Methods and compositions for analysis of regulatory sequences

ABSTRACT

Methods for constructing arrays of regulatory sequences, and the arrays so obtained, are provided. Regulatory sequences for use on the arrays are isolated based on their accessibility in cellular chromatin. A number of methods for using the arrays are disclosed, including regulatory DNA profiling, epigenome profiling, toxicological profiling and identification of in vivo binding sites of DNA binding proteins in complex genomes.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional application no.60/426,934, filed Nov. 15, 2002, which application is herebyincorporated by reference its entirety herein.

TECHNICAL FIELD

The present disclosure is in the field of bioinformatics, generegulation, gene regulatory sequences, gene regulatory proteins, methodsof characterizing cells according to their spectra of regulatory DNAsequences, and microarray technology.

BACKGROUND

Through a concerted worldwide effort, significant progress has been madein determining the number and location of all human genes. Currentestimates suggest that there are approximately 35,000 genes in the humangenome. However, in order for this knowledge about human genes to betruly useful for biological research and biomedical applications, boththe location and activity of regulatory sites in the DNA that controlexpression of these genes must be determined. To date, relatively littleprogress has been made in determining the sequence and/or location ofthis “regulatory DNA”—the regions of the genome that are responsible forcontrolling gene expression. Current efforts aimed at identifying humanregulatory DNA are limited to informatics approaches, for examplealgorithms that attempt to discern regulatory DNA from so-called “junk”DNA using cross-species comparisons as a basis for assessment. Thesebioinformatic methods have yielded very limited information and have notallowed for accurate and complete identification of all regulatory DNA.Other methods for localization of regulatory sequences, such as analysisof nuclease hyper sensitive sites in cellular chromatin, destrovregulatory DNA in the process of identifying it, thereby precluding theisolation and sequence determination of these regulatory sequences.

Similarly, for regulatory sequences that have been identified, there areno methods for determining whether a set of regulatory sequences isactive or inactive in a particular cell or tissue type. Thus, thereremains a need for compositions and methods for determining the positionand/or sequence and/or activity of nucleotide sequences in the humangenome that perform transcriptional regulatory functions.

Moreover, transcriptional regulatory networks in the human genome aremapped at present on a gene-by-gene basis, and no massively parallelmapping strategy exists. Attempts have been made to use genome-wideexpression profiling for this purpose, but even studies conducted on therelatively simple yeast genome have demonstrated that using thisapproach by itself reveals transcriptional phenotype, not the underlyingtranscriptional program. Giaver et al. (2002) Nature 418:387-391;Birrell et al. (2002) Proc. Nat'l Acad Sci USA 99:8778-8783; Kozlova etal. (2000) Trends Endocrinol Metab 11:276-280; Nal et al. (2001)Bioessays 23:473476; Pilpel et al.

Accordingly, there is a need for methods and compositions forintegrating data obtained from the following studies: comparison of acells' transcriptional profile under normal and “diseased” conditions;computational analysis of regulatory DNA of genes that becomederegulated during disease; and/or experimental genome-wide analysis oftranscription factor binding in vivo.

SUMMARY

Described herein are methods for the use of libraries of regulatorysequences obtained based on accessibility of nucleotide sequences incellular chromatin. In particular, sequences obtained from suchlibraries are placed on one or more nucleic acid arrays (e.g., amicroarray). Such arrays of regulatory sequences can be used for anumber of purposes including, for example, characterizing thedistribution of binding sites in a cellular genome for a givenregulatory molecule, determination of the nature, location and sequenceof active regulatory sequences in a cellular genorne, determination ofwhether chromatin modification (e.g., covalent histone modificationssuch as methylation, acetylation and/or phosphorylation) has occurred atone or more regulatory sequences in a cellular genome, determination ofthe effects of compounds (e.g., toxins, organic molecules) on thepreceding three processes, determination of the presence of asingle-nucleotide polymorphisms (SNPs) or haplotypes in a regulatorysequence in a cell, and identification of templates for microRNAs.

The methods generally involve obtaining a collection of accessiblesequences, constructing an array (e.g., microarray) comprising theaccessible sequences and using one or more of the arrays forhybridization to a collection of polynucleotide sequences. Use of thesemicroarrays (also referred to as “regDNA chips”) allows any researchgroup to rapidly determine how regulatory DNA sites are used in any cellor tissue.

In one aspect, a method for making an array is provided, the methodcomprising: (a) isolating a plurality of cellular polynucleotidesequences, whereby the sequences are isolated based on theiraccessibility in cellular chromatin; and (b) attaching each of theisolated sequences to an address on a solid support.

In another aspect, provided herein is an array comprising a plurality ofaccessible polynucleotide sequences, wherein: (a) the sequences areisolated based on their accessibility in cellular chromatin; and (b)each accessible sequence is located at a distinct address on a solidsupport. In certain embodiments, the accessible sequences are isolatedfrom a plurality of different cell types from an organism.

In certain additional embodiments, the accessible sequences are isolatedfrom a single cell or tissue type from an organism. In furtherembodiments, the accessible sequences may be isolated, for example, by(a) isolating a first plurality of cellular polynucleotide sequences,whereby the sequences are isolated based on their accessibility incellular chromatin from a first cell; (b) isolating a second pluralityof cellular polynucleotide sequences, whereby the sequences are isolatedbased on their accessibility in cellular chromatin from a second cell;(c) obtaining sequences that are unique to either the first or secondplurality of cellular polynucleotide sequences; and (d) attaching eachof the isolated sequences obtained in step (c) to an address on a solidsupport.

In another aspect, provided herein is a method of identifying a targetsequence bound by a DNA-binding protein, the method comprising the stepsof: (a) contacting at least one DNA-binding protein with one or more ofthe arrays described herein, under conditions such that the proteinbinds to accessible sequences comprising a target sequence bound by theprotein; (b) removing unbound proteins; and (c) identifying theaccessible sequences bound by the protein, thereby identifying targetsequences for the protein. Optionally, the protein can be labeled with adetectable label.

In yet another aspect, provided herein is a method of identifying atranscription factor, the method comprising the steps of: (a) preparinga preparation of proteins from a cell; (b) contacting the isolatedproteins with one or more of the arrays described herein, underconditions such that transcription factors in the protein preparationbind to accessible sequences comprising a target sequence bound by atranscription factor; (c) removing unbound proteins; and (d) identifyingthe proteins bound to the array. Optionally, the protein can be labeledwith a detectable label.

In a still further aspect, provided herein is a method for obtaining aregulatory profile of accessible sequences in a cell, the methodcomprising: (a) isolating a plurality of polynucleotide sequences fromthe cell, whereby the sequences are isolated based on theiraccessibility in cellular chromatin; (b) optionally amplifying thesequences obtained in step (a); (c) optionally labeling the sequences ofstep (a) or (b); (d) contacting the sequences of step (a), (b) or (c)with one or more of the arrays described herein; and (e) identifying theaccessible sequences bound on the array, thereby identifying sequencesthat are accessible in the cell.

In yet another aspect, provided herein is a method for identifyingfunctional binding sites for a DNA-binding protein in a cell, the methodcomprising: (a) subjecting a cell to conditions under which DNA-bindingproteins are crosslinked to their binding sites in cellular chromatin;(b) shearing the crosslinked cellular chromatin of step (a); (c)immunoprecipitating the sheared crosslinked chromatin of step (b) withan antibody which recognizes the DNA-binding protein; (d) reversing thecrosslinks in the immunoprecipitate of step (c); (e) purifying the DNAfrom the immunoprecipitated material of step (d); (f) optionallyamplifying the DNA obtained in step (e); (g) optionally labeling the DNAof step (e) or (f); (h) contacting the DNA from step (e), (f) or (g)with one or more of the arrays described herein; and (i) identifying theaccessible sequences bound on the array, thereby identifying functionalbinding sites for the DNA-binding protein in the cell.

In a still further aspect, provided herein is a method of identifying asequence in cellular chromatin, wherein the clromatin is covalentlymodified, the method comprising: (a) providing a sample of cellularchromatin; (b) optionally subjecting the chromatin of step (a) toconditions under which DNA-binding proteins are crosslinked to theirbinding sites in cellular chromatin; (c) shearing the cellular chromatinof step (a) or (b); (d) imnmunoprecipitating the sheared chromatin ofstep (c) with an antibody which recognizes a covalent chromatinmodification; (e) purifying the DNA from the immunoprecipitated materialof step (d); (f) optionally amplifying the DNA obtained in step (e); (g)optionally labeling the DNA of step (e) or (f); (h) contacting the DNAfrom step (e), (f) or (g) with one or more of the arrays describedherein; and (i) identifying the accessible sequences bound on the array,thereby identifying sequences in cellular chromatin wherein thechromatin is covalently modified. In any of these methods, the cellularchromatin may be, for example, in an isolated nucleus or collection ofnuclei, or in a cell.

In yet another aspect, provided herein is a method for characterizingthe effects of a molecule on a cell, the method comprising: (a)contacting the cell with the molecule; (b) isolating a first pluralityof polynucleotide sequences from the cell of step (a), whereby thesequences are isolated based on their accessibility in cellularchromatin; (c) optionally amplifying the sequences obtained in step (b);(d) optionally labeling the sequences of step (b) or (c); (e) contactingthe sequences of step (b), (c) or (d) with one or more of the arraysdescribed herein; and (f) identifying the accessible sequences bound onthe array, thereby identifying sequences that are accessible in thecell. In certain embodiments, the method further comprises the steps of(g) providing cells that have not been contacted with the molecule; (h)isolating a second plurality of polynucleotide sequences from the cellof step (g), whereby the sequences are isolated based on theiraccessibility in cellular chromatin; (i) optionally amplifying thesequences obtained in step (h); (j) obtaining sequences that are uniqueto either the first or second plurality of polynucleotide sequences; (c)optionally amplifying the sequences obtained in step (j); (l) optionallylabeling the sequences of step (i) or (j); (m) contacting the sequencesof step (j), (k) or (l) with one or more of the arrays described herein;and (n) identifying the accessible sequences bound on the array, therebyidentifying differences in accessible sequences bet veen cells that haveand have not been contacted with the molecule.

In a still further aspect, provided herein is a method of identifyingsingle nucleotide polymorphisms (SNPs) in regulatory sequences of anindividual, the method comprising the steps of: (a) preparing a libraryof regulatory DNA sequences from chromatin isolated from cells from theindividual; (b) optionally labeling the sequences of step (a); (c)hybridizing the sequences of step (a) or (b) to an array describedherein, under stringent hybridization conditions, wherein the regulatoryDNA sequences of the library hybridize to complementary accessiblesequences on the array; (d) removing regulatory DNA sequences of thelibrary that are not bound to accessible sequences on the array; and (e)identifying accessible sequences on the array that are not hybridized toregulatory DNA sequences of the library, wherein the unbound accessiblesequences on the array suggest the presence of a SNP in regulatorysequences of the individual corresponding to the unbound accessiblesequence.

In any of the methods described herein, the DNA-binding protein may be,for example, a transcription factor, a hormone receptor (e.g., estrogenreceptor), a replication factor or a recombination factor.

In yet another aspect, provided herein is a method-for characterizingthe effects of a stimulus on a cell, the method comprising: (a)subjecting the cell to the stimulus; (b) isolating a first plurality ofpolynucleotide sequences from the cell of step (a), whereby thesequences are isolated based on their accessibility in cellularchromatin; (c) optionally amplifying the sequences obtained in step (b);(d) optionally labeling the sequences of step (b) or (c); (e) contactingthe sequences of step (b), (c) or (d) with one or more of the arraysdescribed herein; and (f) identifying the accessible sequences bound onthe array, thereby identifying sequences that are accessible in thecell. In certain embodiments, the method further comprises the steps of:(g) providing cells that have not been subjected to the stimulus; (h)isolating a second plurality of polynucleotide sequences from the cellof step (g), whereby the sequences are isolated based on theiraccessibility in cellular chromatin; (i) optionally amplifying thesequences obtained in step (h); (j) obtaining sequences that are uniqueto either the first or second plurality of polynucleotide sequences; (k)optionally amplifying the sequences obtained in step (j); (l) optionallylabeling the sequences of step (j) or (k); (m) contacting the sequencesof step (l), (k) or (l) with one or more of the arrays described herein;and (n) identifying the accessible sequences bound on the array, therebyidentifying differences in accessible sequences between cells that haveand have not been subjected to the stimulus. The stimulus may be, forexample, disease state, infection, exposure to one or more drugs,stress, exposure to toxins, and combinations thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic depicting an exemplary transcriptional regulatorycircuit.

FIG. 2, panels A-D, are blots depicting the location of DNAseIhypersensitive sites in vivo using clones isolated from a library ofregulatory DNAs as probes. In each panel, the left lane is a control (noDNase); the middle lanes contain DNA from nuclei treated with DNAse I(increasing concentrations of DNasel indicated by the height of thewedge), and the right lane (“M”) contains a marker. The location of thehypersensitive site is indicated by a triple line; the location of theregulatory DNA clone, determined by comparison of the marker lane(labeled “M”) with additional molecular weight markers (not shown) isindicated by the horizontal arrowhead. In Panel A, the horizontalarrowhead marks the clone location at the transcription start site ofthe gene HSPC142 on chromosome 19. The horizontal arrowhead in Panel Bdepicts the clone location two klb upstream of the transcription startsite of PP5395 on chromosome 10. In Panel C, the horizontal arrowheadmarks the clone location sixteen kb upstream of the transcription startsite of UPK3 on chromosome 22 and in Panel D, the clone is locatedtwenty five kb downstream of the transcription start site of SARTI.

Vertical arrows in panels A and B represent portions of the transcribedregion of gene; with the tail of the arrow corresponding to thetranscription startsite.

FIG. 3, Panels A and B, are pie graphs depicting regulatory DNA libraryclone distribution (Panel A) and distribution of DNA in the genome(Panel B). Panel A depicts the location of 405 clones from a HEK 293regulatory DNA library. Panel B depicts the expected distribution if thelibrary contained randomly isolated 500 bp fragments from the genome.

FIG. 4 is a graph depicting mouse-human evolutionary conservation scoreusing a nonpromoter clone from the regulatory DNA library (location onthe genome indicated by the black bar at top center). The chromosomalsequence depicted includes a stretch of human chromosome 22 containingthe transcription start site of the OLIG2 gene. The grayscale graphshows mouse-human sequence conservation across this region (the heightof the peak corresponds to the degree of conservation). The corepromoter is located at the peak on the right indicated by the arrow 1beneath the graph. A small peak of mouse-human conservation (indicatedby the number 2 beneath the graph) precisely coincides with the locationof the clone from the regulatory DNA library (black bar above the graphin center).

FIG. 5 is a schematic flowchart depicting steps used in constructing anarray to map intergenic yeast regions. The first three steps areessentially chromatin irniunoprecipitation (ChIP). Unlike humans,regulatory regions in yeast are intergenic. Accordingly, in yeast, theproducts of chromatin immunoprecipitation can be directly assessed usingmicroarrays of yeast intergenic regions.

FIG. 6 is a flowchart depicting various steps used to assess regulatoryDNA.

DETAILED DESCRIPTION

The ability to isolate and identify human regulatory DNA on agenome-wide scale is a unique capability. The construction ofmicroarrays comprising a plurality of regulatory sequences, isolated byvirtue of their accessibility in cellular chromatin, allows many typesof analysis of cellular regulatory mechanisms, as described herein.

The practice of conventional techniques in molecular biology,biochemistry, cliromatin structure and analysis, computationalchemistry, cell culture, recombinant DNA, bioinfornatics, genomics andrelated fields are well-known to those of skill in the art and arediscussed, for example, in the following literature references: Sambrooket al. MOLECULAR CLONING: A LABORATORY MANUAL, Second edition, ColdSpring Harbor Laboratory Press, 1989 and Third edition, 2001; Ausubel etal., CURRENT PROTOCOLS IN MOLECULAR BIOLOGY, John Wiley & Sons, NewYork, 1987 and periodic updates; the series METHODS IN ENZYMOLOGY,Academic Press, San Diego; Wolffe, CHROMATIN STRUCTURE AND FUNCTION,Third edition, Academic Press, San Diego, 1998; METHODS IN ENZYMOLOGY,Vol. 304, “Chromatin” (P. M. Wassarman and A. P. Wolffe, eds.), AcademicPress, San Diego, 1999; and METHODS IN MOLECULAR BIOLOGY, Vol. 119,“Chromatin Protocols” (P. B. Becker, ed.) Humana Press, Totowa, 1999,all of which are incorporated by reference in their entireties.

I. Definitions

The terms “nucleic acid,” “polynucleotide,” and “oligonucleotide” areused interchangeably and refer to a deoxyribonucleotide orribonucleotide polymer in either single- or double-stranded form. Forthe purposes of the present disclosure, these terms are not to beconstrued as limiting with respect to the length of a polymer. The termscan encompass known analogues of natural nucleotides, as well asnucleotides that are modified in the base, sugar and/or phosphatemoieties. In general, an analogue of a particular nucleotide has thesame base-pairing specificity; ie., an analogue of A will base-pair withT. The terms also encompasses nucleic acids containing modified backboneresidues or linkages, which are synthetic, naturally occurring, andnon-naturally occurring, which have similar binding properties as thereference nucleic acid, and which are metabolized in a manner similar tothe reference nucleotides. Examples of such analogs include, withoutlimitation, phosphorothioates, phosphoramidates, methyl phosphonates,chiral-methyl phosphonates, 2-O-methyl ribonucleotides, peptide-nucleicacids (PNAs).

Unless otherwise indicated, a particular nucleic acid sequence alsoimplicitly encompasses conservatively modified variants thereof (e.g.,degenerate codon substitutions) and complementary sequences, as well asthe sequence explicitly indicated. Nucleic acids include, for example,genes, cDNAs, and mRNAs. Polynucleotide sequences are displayed hereinin the conventional 5′-3′ orientation.

The terms “polypeptide,” “peptide” and “protein” are usedinterchangeably herein to refer to a polymer of amino acid residues. Theterms apply to amino acid polymers in which one or more amino acidresidue is an analog or mimetic of a corresponding naturally occurringamino acid, as well as to naturally occurring amino acid polymers.Polypeptides can be modified, e.g., by the addition of carbohydrateresidues to form glycoproteins. The terms “polypeptide,” “peptide” and“protein” include glycoproteins, as well as non-glycoproteins. Thepolypeptide sequences are displayed herein in the conventionalN-terminal to C-terminal orientation.

Binding refers to an interaction between two molecules; e.g., betweentwo proteins, between a protein and a small molecule (molecular weight<10 kD) ligand, between a protein and a nucleic acid or between twosingle-stranded nucleic acids to form a nucleic acid duplex or “hybrid.”Binding can be covalent or non-covalent and can be specific ornon-specific. Protein-nucleic binding and nucleic acid-nucleic acidbinding is often sequence-specific, but is not necessarily so. Methodsfor determining sequence-specificity of binding interactions are knownin the art.

Nucleotide sequence-specific binding between two single-strandedpolynucleotides, mediated by base-pairing, to form a double-strandedpolynucleotide, is known as “aimealing,” “hybridization” or“renaturation.” One of the two single-stranded polynucleotides issometimes referred to as a “hybridization probe” and the other a“target” nucleic acid. A probe nucleic acid is often labeled, by methodsknown in the art. In this way duplex polynucleotides formed byhybridization can be detected.

Conditions for hybridization are well-known to those of skill in theart. Hybridization stringency refers to the degree to whichhybridization conditions disfavor the formation of hybrids containingmismatched nucleotides, with higher stringency correlated with a lowertolerance for mismatched hybrids. Factors that affect the stringency ofhybridization are well-known to those of skill in the art and include,but are not limited to, temperature, pH, ionic strength, andconcentration of organic solvents such as, for example, formamide anddimethylsulfoxide. As is known to those of skill in the art,hybridization stringency is increased by higher temperatures, lowerionic strength and lower solvent concentrations.

Stringency of hybridization can also be modulated by using certainnucleotide analogues or pendant groups in one and/or the other of thehybridization probe or target nucleic acid. See, for example, U.S. Pat.Nos. 5,801,155; 6,127,121; 6,312,894; 6,485,906; and 6,492,346; and Liuet al. (2003) Science 302:868-871.

With respect to stringency conditions for hybridization, it is wellknown in the art that numerous equivalent conditions can be employed toestablish a particular stringency by varying, for example, the followingfactors: the length and nature of probe and target sequences, basecomposition of the various sequences,. concentrations of salts and otherhybridization solution components, the presence or absence of blockingagents in the hybridization solutions (e.g., dextran sulfate, andpolyethylene glycol), hybridization reaction temperature and timeparameters, as well as, varying wash conditions. The selection of aparticular set of hybridization conditions is accomplished followingstandard methods in the art. See, for example, Sambrook, et al.,Molecular Cloning: A Laboratory Manual, Second Edition, (1989) ColdSpring Harbor, N.Y.; Nucleic Acid Hybridization: A Practical Aproach,editors B. D. Hames and S.J. Higgins, (1985) Oxford; Washington, D.C.;IRL Press.

A “binding protein” “or binding domain” is a protein or polypeptide thatis able to bind covalently or non-covalently to another molecule.Non-covalent binding includes, but is not limited to, ionic bonding,hydrogen bonding, Van der Vfaal's interactions, hydrophobic interactionsor any combination of the aforementioned. A binding protein can bind to,for example, a DNA molecule (a DNA-binding protein), an RNA molecule (anRNA-binding protein) and/or a protein molecule (a protein-bindingprotein). In the case of a protein-binding protein, it can bind toitself (to form homodimers, homotrimers, etc.) and/or it can bind to oneor more molecules of a different protein or proteins. A binding proteincan have more than one type of binding activity. For example, zincfinger proteins have DNA-binding, RNA-binding and protein-bindingactivity.

The interaction between a DNA-binding protein and its target sequencecan be characterized by its affinity and by its specificity. Affinityrefers to the strength of the binding interaction and can be expressedquantitatively as a dissociation constant (K_(d)). Specificity refers tothe degree to which a binding protein binds more strongly to onesequence (e.g., its target sequence) that to another related sequence.High-affinity binding between, for example, a DNA-binding protein and aspecific DNA target sequence is characterized by a dissociation constantof 1×10⁻⁶ M or lower.

A “zinc finger binding protein” is a protein or polypeptide that bindsDNA, RNA and/or protein, preferably in a sequence-specific manner, as aresult of stabilization of protein structure through coordination of azinc ion. The term zinc finger binding protein is often abbreviated aszinc finger protein or ZFP. The individual DNA binding domains aretypically referred to as “fingers” A ZFP has least one finger, typicallytwo fingers, three fingers, or six fingers. Each finger binds from twoto four base pairs of DNA, typically three or four base pairs of DNA. AZFP binds to a nucleic acid sequence called a target site or targetsegment. Each finger typically comprises an approximately 30 amino acid,zinc-chelating, DNA-binding subdomain. An exemplary motif characterizingone class of these proteins (C₂H₂ class) is-Cys-(X)₂₋₄-Cys-(X)₁₂-His-(X)₃₋₅-His (where X is any amino acid) (SEQ IDNO: 1). Studies have demonstrated that a single zinc finger of thisclass consists of an alpha helix containing the two invariant histidineresidues coordinated with zinc along with the two cysteine residues of asingle beta turn (see, e.g., Berg & Shi, Science 271:1081-1085 (1996)).

Zinc finger binding domains can be engineered to bind to a predeterminednucleotide sequence. Non-limiting examples of methods for engineeringzinc finger proteins are design and selection.

A “designed” zinc finger protein is a protein not occurring in naturewhose structure and composition result principally from rationalcriteria. Rational criteria for design include application ofsubstitution rules and computerized algorithms for processinginformation in a database storing infonnation of existing ZFP designsand binding data, for example as described in co-owned U.S. Pat. No.6,453,242. See also U.S. Pat. Nos. 6,140,081 and 6,534,261 and WO98/53058; WO 98/53059; WO 98/53060; WO 02/016536 and WO 03/016496. A“selected” zinc finger protein is a protein not found in nature whoseproduction results primarily from an empirical process such as phagedisplay, interaction trap or hybrid selection. See e.g., U.S. Pat. No.5,789,538; U.S. Pat. No. 5,925,523; U.S. Pat. No. 6,007,988; U.S. Pat.No. 6,013,453; U.S. Pat. No. 6,200,759; WO 95/19431; WO 96/06166; WO98/53057; WO 98/54311; WO 00/27878; WO 01/60970 WO 01/88197 and WO02/099084.

A “target site” or “target sequence” is a sequence that is bound by abinding protein such as, for example, a ZFP. Target sequences can benucleotide sequences (either DNA or RNA) or amino acid sequences. Asingle target site typically has about four to about ten base pairs.Typically, a two-fingered ZFP recognizes a four to seven base pairtarget site, a three-fingered ZFP recognizes a six to ten base pairtarget site, and a six fingered ZFP recognizes two adjacent nine to tenbase pair target sites. By way of example, a DNA target sequence for athree-finger ZFP is generally either 9 or 10 nucleotides in length,depending upon the presence and/or nature of cross-strand interactionsbetween the ZFP and the target sequence.

Target sequences can be found in any DNA or RNA sequence, includingregulatory sequences, exons, introns, or any non-coding sequence.

A “target subsite” or “subsite” is the portion of a DNA target site thatis bound by a single zinc finger, excluding cross-strand interactions.Thus, in the absence of cross-strand interactions, a subsite isgenerally three nucleotides in length. In cases in which a cross-strandinteraction occurs (e.g., a “D-able subsite,” as described for exampleco-owned U.S. Pat. No. 6,453,242, incorporated by reference in itsentirety herein, a subsite is four nucleotides in length and overlapswith another 3- or 4-nucleotide subsite.

Chromatin is the nucleoprotein structure comprising the cellular genome.“Cellular chromatin” comprises nucleic acid, primarily DNA, and protein,including histones and non-histone chromosomal proteins. The majority ofeukaryotic cellular chromatin exists in the forn of nucleosornes,wherein a nucleosome core comprises approximately 150 base pairs of DNAassociated with an octamer comprising two each of histones H2A, H2B, H3and H4; and linker DNA (of variable length depending on the organism)extends between nucleosome cores. A molecule of histone HI is generallyassociated with the linker DNA. For the purposes of the presentdisclosure, the term “chromatin” is meant to encompass all types ofcellular nucleoprotein, both prokaryotic and eukaryotic. Cellularchromatin includes both chromosomal and episomal chromatin.

A “chromosome” is a chromatin complex comprising all or a portion of thegenome of a cell. The genome of a cell is often characterized by itskaryotype, which is the collection of all the chromosomes that comprisethe genome of the cell. The genome of a cell can comprise one or morechromosomes.

An “episome” is a replicating nucleic acid, nucleoprotein complex orother structure comprising a nucleic acid that is not part of thechromosomal karyotype of a cell. Examples of episomes include plasmidsand certain viral genomes.

An “accessible region” in cellular chromatin is generally one that doesnot have a typical nucleosomal structure. As such, an accessible regioncan be identified and localized by, for example, the use of chemicalsand/or enzymes that probe chromatin structure. Accessible regions will,in general, have an altered reactivity to a probe, compared to bulkchromatin. An accessible region may be sensitive to the probe, comparedto bulk chromatin, or it may have a pattern of sensitivity that isdifferent from the pattern of sensitivity exhibited by bulk chromatin.Accessible regions can be identified by any method known to those ofskill in the art for probing chromatin structure.

In one embodiment, an enzymatic probe of chromatin structure is used toidentify an accessible region. In a preferred embodiment, the enzymaticprobe is DNase I (pancreatic deoxyribonuclease). Regions of cellularchromatin that exhibit enhanced sensitivity to digestion by DNase I,compared to bulk chromatin (i.e., DNase-hypersensitive sites) are morelikely to have a structure that is favorable to the binding of anexogenous molecule, since the nucleosomal structure of bulk chromatin isgenerally less conducive to binding of an exogenous molecule.Furthermore, DNase-hypersensitive regions of chromatin often contain DNAsequences involved in the regulation of gene expression. Thus, bindingof an exogenous molecule to a DNase-hypersensitive chromatin region ismore likely to have an effect on gene regulation.

In a separate embodiment, micrococcal nuclease ease) is used as a probeof chromatin structure to identify an accessible region. MNasepreferentially digests the linker DNA present between nucleosomes,compared to bulk chromatin. It is likely that such linker DNA sequencesare more apt to be bound by an exogenous molecule that are sequencespresent in nucleosomal DNA, which is wrapped around a histone octamer.

Additional enzymatic probes of chromatin structure include, but are notlimited to, exonuclease III, S1 nuclease, mung bean nuclease, DNAmethyltransferases and restriction endonucleases. In addition, themethod described by van Steensel et al. (2000) Nature Biotechnology18:424428 can be used to identify an accessible region.

Chemical probes of chromatin structure, useful in the identification ofaccessible regions, include, but are not limited to, hydroxyl radicals,methidiumpropyl-EDTA.Fe(II) (MPE) and crosslinkers such as psoralen.See, for example, Tullius et al. (1987) Meth. Enzymology, Vol. 155, (J.Ableson & M. Simon, eds.) Academic Press, San Diego, pp. 537-558;Cartwright et al. (1983) Proc. Natl. Acad. Sci. USA 80:3213-3217;Hertzberg et al. (1984) Biochemistry 23:3934-3945; and Wellinger et al.in Methods in Molecular Biology, Vol. 119 (P. Becker, ed.) Humana Press,Totowa, N.J., pp. 161-173.

It will clear that the aforementioned “probes of chromatin structure”are distinct from the “hybridization probes” also disclosed herein, andthe differences will be clear to one of skill in the art.

A “gene,” for the purposes of the present disclosure, includes a DNAregion encoding a gene product, as well as all DNA regions that regulatethe production of the gene product, whether or not such regulatorysequences are adjacent to coding and/or transcribed sequences.Accordingly, a gene includes, but is not necessarily limited to,promoter sequences, terminators, translational regulatory sequences suchas ribosome binding sites and internal ribosome entry sites, enhancers,silencers, insulators, boundary elements, replication origins, matrixattachment sites and locus control regions.

“Gene expression” refers to the conversion of the information, containedin a gene, into a gene product. A gene product can be the directtranscriptional product of a gene (e.g., mRNA, tRNA, rRNA, antisenseRNA, ribozyme, structural RNA or any other type of RNA) or a proteinproduced by translation of a MnRNA. Gene products also include RNAs thatare modified, by processes such as capping, polyadenylation,methylation, and editing, and proteins modified by, for example,methylation, acetylation, phosphorylation, ubiquitination,ADP-ribosylation, myristilation, and glycosylation.

“Gene activation” and “augmentation of gene expression” refer to anyprocess that results in an increase in production of a gene product. Agene product can be either RNA (including, but not limited to, mRNA,rRNA, tRNA, and structural RNA) or protein. Accordingly, gene activationincludes those processes that increase transcription of a gene and/ortranslation of a MRNA. Examples of gene activation processes whichincrease transcription include, but are not limited to, those whichfacilitate formation of a transcription initiation complex, those whichincrease transcription initiation rate, those which increasetranscription elongation rate, those which increase processivity oftranscription and those which relieve transcriptional repression (by,for example, blocking the binding of a transcriptional repressor). Geneactivation can constitute, for example, inhibition of repression as wellas stimulation of expression above an existing level. Examples of geneactivation processes which increase translation include those whichincrease translational initiation, those which increase translationalelongation and those which increase mRNA stability In general, geneactivation comprises any detectable increase in the production of a geneproduct, preferably an increase in production of a gene product by about2-fold, more preferably from about 2- to about 5-fold or any integertherebetween, more preferably between about 5- and about 10-fold or anyinteger therebetween, more preferably between about 10- and about20-fold or any integer therebetween, still more preferably between about20- and about 50-fold or any integer therebetween, more preferablybetween about 50- and about 100-fold or any integer therebetween, morepreferably 100-fold or more.

“Gene repression” and “inhibition of gene expression” refer to anyprocess that results in a decrease in production of a gene product. Agene product can be either RNA (including, but not limited to, mRNA,rRNA, tRNA, and structural RNA) or protein. Accordingly, gene repressionincludes those processes that decrease transcription of a gene and/ortranslation of a mRNA. Examples of gene repression processes whichdecrease transcription include, but are not limited to, those whichinhibit formation of a transcription initiation complex, those whichdecrease transcription initiation rate, those which decreasetranscription elongation rate, those which decrease processivity oftranscription and those which antagonize transcriptional activation (by,for example, blocking the binding of a transcriptional activator). Generepression can constitute, for example, prevention of activation as wellas inhibition of expression below an existing level. Examples of generepression processes that decrease translation include those thatdecrease translational initiation, those that decrease translationalelongation and those that decrease mRNA stability. Transcriptionalrepression includes both reversible and irreversible inactivation ofgene transcription. In general, gene repression comprises any detectabledecrease in the production of a gene product, preferably a decrease inproduction of a gene product by about 2-fold, more preferably from about2- to about 5-fold or any integer therebetween, more preferably betweenabout 5- and about 10-fold or any integer therebetween, more preferablybetween about 10- and about 20-fold or aniy integer therebetween, stillmore preferably between about 20- and about 50-fold or any integertlherebetween, more preferably between about 50- and about 100-fold orany integer therebetween, more preferably 100-fold or more. Mostpreferably, gene repression results in complete inhibition of geneexpression, such that no gene product is detectable.

The term “modulate” refers to a change in the quantity, degree or extentof a function. For example, the modified zinc finger-nucleotide bindingpolypeptides disclosed herein may modulate the activity of a promotersequence by binding to a motif within the promoter, thereby inducing,enhancing or suppressing transcription of a gene operatively linked tothe promoter sequence. Altematively, modulation may include inhibitionof transcription of a gene wherein the modified zinc finger-nucleotidebinding polypeptide binds to the structural gene and blocks DNAdependent RNA polymerase from reading through the gene, thus inhibitingtranscription of the gene. The structural gene may be a normal cellulargene or an oncogene, for example. Alternatively, modulation may includeinhibition of translation of a transcript. Thus, “modulation” of geneexpression includes both gene activation and gene repression.

Modulation can be assayed by determining any parameter that isindirectly or directly affected by the expression of the target gene.Such parameters include, e.g., changes in RNA or protein levels; changesin protein activity; changes in product levels; changes in downstreamgene expression; changes in transcription or activity of reporter genessuch as, for example, luciferase, CAT, beta-galactosidase, or GFP (see,e.g., Mistili & Spector, (1997) Nature Biotechnology) 15:961-964);changes in signal transduction; changes in phosphorylation anddephosphorylation; changes in receptor-ligand interactions; changes inconcentrations of second messengers such as, for example, cGMP, cAMP,IP3, and Ca2⁺; changes in cell growth, changes in neovascularization,and/or changes in any functional effect of gene expression. Measurementscan be made iin vitro, ini vivo, and/or ex vivo. Such functional effectscan be measured by conventional methods, e.g., measurement of RNA orprotein levels, measurement of RNA stability, and/or identification ofdownstream or reporter gene expression. Readout can be by way of, forexample, chemiluminescence, fluorescence, colorimetric reactions,antibody binding, inducible markers, ligand binding assays; changes inintracellular second messengers such as cGMP and inositol triphosphate(IP₃); changes in intracellular calcium levels; cytokine release, andthe lice.

Accordingly, the terms “modulating expression” “inhibiting expression”and “activating expression” of a gene can refer to the ability of amolecule to activate or inhibit transcription of a gene. Activationincludes prevention of transcriptional inhibition (i.e., prevention ofrepression of gene expression) and inhibition includes prevention oftranscriptional activation (i.e., prevention of gene activation).

A “functional fragment” of a protein, polypeptide or nucleic acid is aprotein, polypeptide or nucleic acid whose sequence is not identical tothe full-length protein, polypeptide or nucleic acid, yet retains thesame function as the full-length protein, polypeptide or nucleic acid. Afunctional fragment can possess more, fewer, or the same number ofresidues as the corresponding native molecule, and/or can contain oneore more amino acid or nucleotide substitutions. Methods for determiningthe function of a nucleic acid (e.g., coding. function, ability tohybridize to another nucleic acid) are well-known in the art. Similarly,methods for determining protein function are well-known. For example,the DNA-binding function of a polypeptide can be determined, forexample, by filter-binding, electrophoretic mobility-shift, orimmunoprecipitation assays. See Ausubel et al., suppra. The ability of aprotein to interact with another protein can be determined, for example,by co-immunoprecipitation, two-hybrid assays or complementation, bothgenetic and biochemical. See, for example, Fields et al. (1989) Nature340:245-246; U.S. Pat. No. 5,585,245 and PCT WO 98/44350.

A “fusion molecule” is a molecule in which two or more subunit moleculesare linked, preferably covalently. The subunit molecules can be the samechemical type of molecule, or can be different chemical types ofmolecules. Examples of the first type of fusion molecule include, butare not limited to, fusion polypeptides (for example, a fusion between aZFP DNA-binding domain and a transcriptional activation domain) andfusion nucleic acids (for example, a nucleic acid encoding the fusionpolypeptide described herein). Examples of the second type of fusionmolecule include, but are not limited to, a fusion between atriplex-forming nucleic acid and a polypeptide, and a fusion between aminor groove binder and a nucleic acid.

The term “heterologous” is a relative term, which when used withreference to portions of a nucleic acid indicates that the nucleic acidcomprises two or more subsequences that are not found in the samerelationship to each other in nature. For instance, a nucleic acid thatis recombinantly produced typically has two or more sequences fromunrelated genes synthetically arranged to make a new functional nucleicacid, e.g., a promoter from one source and a coding region from anothersource. The two nucleic acids are thus heterologous to each other inthis context. When added to a cell, the recombinant nucleic acids wouldalso be heterologous to the endogenous genes of the cell. Thus, in achromosome, a heterologous nucleic acid would include an non-native(non-naturally occurring) nucleic acid that has integrated into thechromosome, or a non-native (non-naturally occurring) extrachromosomalnucleic acid. In contrast, a naturally translocated piece of chromosomewould not be considered heterologous in the context of this patentapplication, as it comprises an endogenous nucleic acid sequence that isnative to the mutated cell.

Similarly, a heterologous protein indicates that the protein comprisestwo or more subsequences that are not found in the same relationship toeach other in nature (e.g., a “fusion protein,” where the twosubsequences are encoded by a single nucleic acid sequence). See, e.g.,Ausubel, supra, for an introduction to recombinant techniques.

The term “recombinant when used with reference, e.g., to a cell, ornucleic acid, protein, or vector, indicates that the cell, nucleic acid,protein or vector, has been modified by the introduction of aheterologous nucleic acid or protein or the alteration of a nativenucleic acid or protein, or that the cell is derived from a cell somodified. Thus, for example, recombinant cells express genes that arenot found within the native (naturally occurring) form of the cell orexpress a second copy of a native gene that is otherwise normally orabnormally expressed, under expressed or not expressed at all.

Nucleic acid or amino acid sequences are “operably linked” (or“operatively linked”) when placed into a functional relationship withone another. For instance, a promoter or enhancer is operably linked toa coding sequence if it regulates, or contributes to the modulation of,the transcription of the coding sequence. Operably linked DNA sequencesare typically contiguous, and operably linked amino acid sequences aretypically contiguous and in the same reading frame. However, sinceenhancers generally function when separated from the promoter by up toseveral kilobases or more and intronic sequences may be of variablelengths, some polynucleotide elements may be operably linked but notcontiguous. Similarly, certain amino acid sequences that arenon-contiguous in a primary polypeptide sequence may nonetheless beoperably linked due to, for example folding of a polypeptide chain.

With respect to fusion polypeptides, the terms “operatively linked” and“operably linked” can refer to the fact that each of the componentsperforms the same function in linkage to the other component as it wouldif it were not so linked. For example, with respect to a fusionpolypeptide in which a ZFP DNA-binding domain is fused to atranscriptional activation domain (or functional fragment thereof), theZFP DNA-binding domain and the transcriptional activation domain (orfunctional fragment thereof) are in operative linkage if, in the fusionpolypeptide, the ZFP DNA-binding domain portion is able to bind itstarget site and/or its binding site, while the transcriptionalactivation domain (or functional fragment thereof) is able to activatetranscription.

An “expression vector” is a nucleic acid construct, generatedrecombinantly or synthetically, with a series of specified nucleic acidelements that permit transcription of a particular nucleic acid in ahost cell, and optionally integration or replication of the expressionvector in a host cell. The expression vector can be part of a plasmid,virus, or nucleic acid fragment, of viral or non-viral origin.Typically, the expression vector includes an “expression cassette,”which comprises a nucleic acid to be transcribed operably linked to apromoter. The term expression vector also encompasses naked DNA operablylinked to a promoter.

“Eucaryotic cells” include, but are not limited to, fungal cells (suchas yeast), plant cells, animal cells, mammalian cells and human cells.

The term “comrnon,” when used in reference to two or more polynucleotidesequences being compared, refers to polynucleotides that (i) exhibit aselected percentage of sequence identity (as defined below, typicallybetween 80-100% sequence identity) and/or (ii) are located in similarpositions, relative to a gene of interest. Likewise, the term “unique,”when used in reference to two or more polynucleotide sequences beingcompared, refers to polynucleotides that (i) do not exhibit a selectedpercentage of sequence identity as defined below, typically less than80% sequence identity) and/or (ii) are located in one or more differentpositions relative to a gene of interest.

“Sequence similarity” refers to the percent similarity in base pairsequence (as determined by any suitable method) between two or morepolynucleotide sequences. Two or more sequences can be anywhere from0-100% similar, or any integer value therebetween. Furthermore,sequences are considered to exhibit “sequence identity” when they are atleast about 80-85%, preferably at least about 85-90%, more preferably atleast about 90-92%, more preferably at least about 93-95%, morepreferably 96-98%, and most preferably at least about 98-100% sequenceidentity (including all integer values falling within these describedranges). These percent identities are, for example, relative to theclaimed sequences, or other sequences, when the sequences obtained bythe methods disclosed herein are used as the query sequence.Additionally, one of skill in the art can readily determine the propersearch parameters to use for any given sequence in the programsdescribed herein. For example, the search parameters may vary based onthe size of the sequence in question. Thus, for example, in certainembodiments, the search is conducted based on the size of the isolatedpolynucleotide(s) corresponding to an accessible region. The isolatedpolynucleotide comprises X contiguous nucleotides and is compared to thesequences of approximately same length, preferably the same length.Exemplary fragment lengths include, but are not limited to, at leastabout 6-1000 contiguous nucleotides (or any integer therebetween), atleast about 50-750 contiguous nucleotides (or any integer therebetween),about 100-300 contiguous nucleotides (or any integer therebetween),wherein such contiguous nucleotides can be derived from a largersequence of contiguous nucleotides.

Techniques for determining nucleic acid and amino acid sequencesimilarity are known in the art. Typically, such techniques includedetermining the nucleotide sequence of, e.g., an accessible region ofcellular cliromatin, and comparing these sequences to a secondnucleotide sequence. Genomic sequences can also be determined andcompared in this fashion. In general, “identity” refers to an exactnucleotide-to-nucleotide or amino acid-to-amino acid correspondence oftwo polynucleotides or polypeptide sequences, respectively. Two or moresequences (polynucleofide or amino acid) can be compared by determiningtheir “percent identity.” The percent identity of two sequences, whethernucleic acid or amino acid sequences, is the number of exact matchesbetween two aligned sequences divided by the length of the shortersequences and multiplied by 100. An approximate alignment for nucleicacid sequences is provided by the local homology algorithm of Smith andWaterman, Advances in Applied Mathematics 2:482489 (1981). Thisalgorithm can be applied to amino acid sequences by using the scoringmatrix developed by Dayhoff, Atlas of Protein Sequences and Structure,M. O. Dayhoff ed., 5 suppl. 3:353-358, National Biomedical ResearchFoundation, Washington, D.C., USA, and normalized by Gribskov, Nucl.Acids Res. 14(6):6745-6763 (1986). An exemplary implementation of thisalgorithm to determine percent identity of a sequence is provided by theGenetics Computer Group (Madison, Wis.) in the “BestFit” utilityapplication. The default parameters for this method are described in theWisconsin Sequence Analysis Package Program Manual, Version 8 (1995)(available from Genetics Computer Group, Madison, Wis.). An additionalmethod of establishing percent identity in the context of the presentdisclosure is to use the MPSRCH package of programs copyrighted by theUniversity of Edinburgh, developed by John F. Collins and Shane S.Sturrok, and distributed by IntelliGenetics, Inc. (Mountain View,Calif.). From this suite of packages the Smith-Waterman algorithm can beemployed where default parameters are used for the scoring table (forexample, gap open penalty of 12, gap extension penalty of one, and a gapof six). From the data generated the “Match” value reflects “sequenceidentity.” Other suitable programs for calculating the percent identityor similarity between sequences are generally known in the art, forexample, another alignment program is BLAST, used with defaultparameters. For example, BLASTN and BLASTP can be used using thefollowing default parameters: genetic code=standard; filter=none;strand=both; cutoff=60; expect=10; Matrix=BLOSUM62; Descriptions=50sequences; sort by=HIGH SCORE; Databases=non-redundant,GenBank+EMBL+DDBJ+PDB+GenBank CDS translations+Swissprotein+Spupdate+PIR. Details of these programs can be found at thefollowing intemet address: http://www.ncbi.nlm.gov/cgi-bin/BLAST. Whenclaiming sequences relative to sequences described herein, the range ofdesired degrees of sequence identity is approximately 80% to 100% andany integer value therebetween. Typically the percent identities betweenthe disclosed sequences and the claimed sequences are at least 70-75%,preferably 80-82%, more preferably 85-90%, even more preferably 92%,still more preferably 95%, and most preferably 98% sequence identity tothe reference sequence.

An “exogenous molecule” is a molecule that is not normally present in acell, but can be introduced into a cell by one or more genetic,biochemical or other methods. Normal presence in the cell is determinedwith respect to the particular developmental stage and environmientalconditions of the cell. Thus, for example, a molecule that is presentonly during embryonic development of muscle is an exogenous moleculewith respect to an adult muscle cell. Similarly, a molecule induced byheat shock is an exogenous molecule with respect to a non-heat-shockedcell. An exogenous molecule can comprise, for example, a functioningversion of a malfunctioning endogenous molecule or a malfunctioningversion of a normally-functioning endogenous molecule. Thus, the term“exogenous regulatory molecule” refers to a molecule that can modulategene expression in a target cell but which is not encoded by thecellular genome of the target cell.

An exogenous molecule can be, among other things, a small molecule (ie.,molecular weight less than 10 kD), such as is generated by acombinatorial chemistry process, or a macromolecule such as a protein,nucleic acid, carbohydrate, lipid, glycoprotein, lipoprotien,polysacchafide, any modified derivative of the above molecules, or anycomplex comprising one or more of the above molecules. Nucleic acidsinclude DNA and RNA, can be single- or double-stranded; can be linear,branched or circular; and can be of any length. Nucleic acids includethose capable of forming duplexes, as well as triplex-forming nucleicacids. See, for example, U.S. Pat. Nos. 5,176,996 and 5,422,251.Proteins include, but are not limited to, DNA-binding proteins,transcription factors, chromatin remodeling factors, methylated DNAbinding proteins, polymerases, methylases, demethylases, acetylases,deacetylases, kinases, phosphatases, integrases, recombinases, ligases,topoisomerases, gyrases and helicases.

An exogenous molecule can be the same type of molecule as an endogenousmolecule, e.g., protein or nucleic acid (i.e., an exogenous gene),providing it has a sequence that is different from an endogenousmolecule. For example, an exogenous nucleic acid can comprise aninfecting viral genome, a plasmid or episome introduced into a cell, ora chromosome that is not normally present in the cell. Methods for theintroduction of exogenous molecules into cells are known to those ofskill in the art and include, but are not limited to, lipid-mediatedtransfer (i.e., liposomes, including neutral and cationic lipids),electroporation, direct injection, cell fusion, particle bombardment,calcium phosphate co-precipitation, DEAE-dextran-mediated transfer andviral vector-mediated transfer.

By contrast, an “endogenous molecule” is one that is normally present ina particular cell at a particular developmental stage under particularenvironmental conditions. For example, an endogenous nucleic acid cancomprise a chromosome, the genome of a mitochondrion, chloroplast orother organelle, or a naturally-occurring episomal nucleic acid.

Additional endogenous molecules can include proteins, for example,transcription factors and components of chromatin remodeling complexes.

Thus, an “endogenous cellular gene” refers to a gene that is native to acell, which is in its normal genomic and chromatin context, and which isnot heterologous to the cell. Such cellular genes include, e.g., animalgenes, plant genes, bacterial genes, protozoal genes, fungal genes,mitrochondrial genes, and chloroplastic genes.

An “endogenous gene” refers to a microbial or viral gene that is part ofa naturally occurring microbial or viral genome in a microbially orvirally infected cell. The microbial or viral genome can beextrachromosomal or integrated into the host chromosome. This term alsoencompasses endogenous cellular genes, as described above.

The term “naturally-occurring” is used to describe an object that can befound in nature, as distinct from being artificially produced by ahuman. Similarly, the term “non-naturally-occurring” refers to an objector composition not found in nature.

II. General Overview

Transcription control pathways underlie nearly every major transition incell, tissue, and organ behavior that occurs during human developmentand disease. As shown in FIG. 1, transcriptional pathways contain threecomponents: (i) an environmental or developmental stimulus, such as arise in hormone concentration, or a particular form of cell-cellinteraction; (ii) a set of transcription factors that respond to thestimulus (directly or indirectly, e.g., via a signaling cascade); (iii)a set of downstream target genes that these transcription factorscontrol by engaging DNA sequences that lie within regulatory DNAelements of these genes, such as promoters and enhancers. Disruption ofnormal transcription pathways often results in disease or pathology, forexample aberrant function of transcription factors at these regulatoryDNA stretches directly causes a considerable proportion of humandisease, including, but not limited to such diseases as cancer (e.g.,breast, ovarian, uterine, prostate, leukemia, lymphoma, etc.);osteoporosis; and asthma.

The first and second components of transcriptional networks have beenwell studied. Indeed, to date over 2,000 different transcription factorshave been identified. In addition, pharmaceutical compounds thatspecifically affect function of these transcription factors are widelyused in clinical practice as therapies, and a great many more arecurrently undergoing clinical trials.

However, little has been learned about the third component oftranscriptional regulatory networks, target genes and their regulatoryregions. Thus, although the stimulus and transcription factorsassociated with many transcriptional networks (e.g., hormone responsesystems such as estrogen, glucocorticoid, vitamin D, thyroid hormone,progesterone, testosterone, and retionic acid; cell cycle systemsinvolving transcription factors such as myc, fos, jun, pRb, p53, E2F,etc; and inflammation pathways such as those involving NF-κB) are known,very little is known about the downstream targets (e.g., genes). Forexample, the direct targets of the estrogen receptor, or of niyc, arepoorly defined. This lack of knowledge represents a major obstacle tomalking progress in developing novel, more effective small moleculecompounds that correct the dysfunction of such networks in disease.Table 1 shows a general overview of the some of the issues addressed,and technical barriers overcome, by the present disclosure.

By providing a collection of regulatory sequences active in a cell undera given set of conditions, the present disclosure allows thoseregulatory sequences to be associated with the gene(s) they regulate,thereby providing new information on the identity of genes whosetranscription is regulated, e.g., by external stimuli, a particulartranscription factor, etc. TABLE 1 Technical Current Issues TargetsPractice Technical Barriers Solution/Approach Mapping regulatoryExperimental <20% Regulatory DNA cannot Massively parallel DNA elementsin the identification of regulatory be comprehensively isolation andcloning human genome all regulatory DNA is identified by computation.procedure for all active DNA elements in identified No high-throughputregulatory DNA the human experimental approach genome. exists.Comprehensive Single-step <5% after No high-throughput Use regulatoryDNA identification of direct identification of laborious methodavailable microarray for high genomic targets for in vivo bindingapproach throughput analysis of transcription factors sites for anytranscription factor factor binding Comprehensive Identify specificLaborious No existing means of Use regulatory DNA mapping oftranscriptional gene-by- uncovering shared microarray for massivelytranscriptional regulatory gene regulatory pathways. parallel mapping ofall networks and their pathway driving analysis regulatory DNA relevantmisregulation in disease to a particular circuit. disease progressionIdentification of the Identify global Genome- No information on what Useof regulatory DNA genome's functional transcription wide controls geneexpression microarray to identify in a state in a given regulatorycircuit expression single step the subset of cell/tissue type. definingcell profiling. regulatory DNA active in phenotype a given cell type.

III. Isolation of Regulatory Sequences

A. General

Regulatory sequences are estimated to occupy between 1 and 10% of thehuman genome. Approximately 80% of these regulatory DNA stretches havenot been identified, largely because, unlike organisms like yeast, notall human regulatory regions occur via core promoter elements adjacentto genes (i.e., in intergenic regions of the genome). See, Wyrick et al.(2002) Curr. Opin Genet Dev 12:130-136; Nal et al. (2001) Bioessays23:473-476. In yeast, regulatory sequences can be readily analyzed bydirect mapping (Ren et al. (2000) Science 290:2306) and/or byexamination of intergenic regions in response to a stimulus (Pilpel etal. (2001) Nat Genet 29:153-159. See, also, FIG. 5. However, suchmethods are currently inapplicable to the human genome, because for anygiven human gene, regulatory sequences are more complex, since theyinclude not only core promoters but, in addition, may also includedistal promoter(s), enhancer(s), insulator(s), silencer(s), boundaryelement(s), locus control region(s), polyA addition sites, sitesinvolved in control of replication (e.g., replication origins),centromeres, telomeres, transcription termination sites, sitesregulating chromosome structure, matrix/scaffold attachment region(s),etc. See, for example, Wingender et al. (1997) Nucleic Acids Res.25:265-268. Moreover, these regulatory regions are typically relativelyshort (˜200 bp) and are dispersed widely through the genome. Forinstance, known regulatory elements that control β-globin geneexpression include five separate approximately 200 bp sequences spreadover 15,000 bp of the genome and 30,000 bp upstream of the gene's startsite. In view of the complexity of human regulatory sequences,computational analysis of genome sequences in humans has not been ableto identify regulatory DNA in the human genome. Pennacchio et al. (2001)Nat Rev Genet 2:100-109; Galas etal. (2001) Science 291:1257-1260.

The failure of computational methods to identify regulatory regions inthe human genome indicates that a different, likely experimental,solution will be required. For example, sensitivity of accessibleregions to nucleases such as DNAseI is a known property of eukaryoticregulatory DNA stretches. See, e.g., Elgin et al. (1988) J. Biol. Chem.263:19259-19262; Grosset al. (1988) Ann Rev Biochen′ 57:159-157. Theaccessibility of DNA in chromatin refers to any property thatdistinguishes a particular region of DNA, in cellular chromatin, frombulk cellular DNA. See, for example, Wolffe “Chromatin: Structure andFunction” 3rd Ed., Academic Press, San Diego, 1998 for a description ofcellular chromatin. For example, an accessible sequence (or accessibleregion) can be one that is not packaged into nucleosomes, or cancomprise DNA present in nucleosomal structures that are different fromthat of bulk nucleosomal DNA (e.g., nucleosomes comprising modifiedhistones). An accessible region includes, but is not limited to, a sitein chromatin at which an enzymatic (e.g., DNAseI) or chemical probereacts, under conditions in which the probe does not react with similarsites in bulk chromatin. Such regions of chromatin can include, forexample, a functional group of a nucleotide, in which case probereaction can generate a modified nucleotide, or a phosphodiester bondbetween two nucleotides, in which case probe reaction can generatepolynucleotide fragments or chromatin fragments. Depending on the celltype or individual, chromatin includes various regions that are more orless accessible. Accessible regions in cellular chromatin may also be“remodeled,” for example, following binding of non-histone proteins tochromatin that may cause localized changes in chromatin structure andconfer a dramatic (often at least an order of magnitude), but highlylocalized (approximately 200 bp), increase in accessibility of theregulatory DNA region to nucleases, such as DNAse I, or restrictionenzymes. Increased accessibility to nucleases is commonly detected usingthe DNAse I hypersensitivity assay, which identifies the genomicposition of these regions, lnown as “DNAse I hypersensitive sites.” See,also, FIG. 2. Although regulatory sequences may be identified on thebasis of their accessibility in cellular chromatin, traditional methodsof identifying regulatory sequences based on such accessibility (e.g., alocus-by-locus analysis involving DNase treatment, Southern-blotting andindirect end-labeling) is exceedingly labor intensive—mapping allregulatory sequences in the genome of a cell would take approximately2,400 person/years using these approaches. Moreover, these methodsdestroy the regulatory sequences in the process of identifying them sothat, although a rough location of the regulatory sequence is obtained,its nucleotide sequence is not.

Unlike the aforementioned traditional mapping methods, the methodsdescribed herein allow for both isolation and characterization ofregulatory regions, and allow the isolation of a plurality of regulatorysequences in a single experiment, without requiring knowledge of thefunctional properties of the sequences. In other words, regulatoryregions are notjust mapped, they are actually isolated (e.g., cloned)and, optionally, sequenced or otherwise characterized. See, also,International Publication WO 01/83732, incorporated herein by referencein its entirety. Once cloned, a collection of isolated regulatorysequences can be attached to an array and used in additional methods ofassessing cellular regulatory processes.

B. Obtaining Marked or Modified Fragments

1. Generally

Certain methods for identifying accessible regions involve the use of anenzymatic probe that modifies DNA in chromatin. Modified regions, whichcomprise accessible sequences, are then identified and can be isolated.Such methods generally comprise the treatment of cellular chromatin witha chenmical and/or enzymatic probe wherein the probe reacts with (e.g.,binds to, covalently modifies or cleaves within) accessible sequences.The treated chromatin is optionally deproteinized and then fragmented toproduce a mixture of polynucleotide fragments, wherein the mixturecomprises fragments containing at least one site that has reacted withthe probe (marked polynucleotide fragments) and fragments that have notreacted with the probe (unmarked polynucleotide fragments). Markedfragments are selected and correspond to accessible regions of cellularchromatin.

Fragmentation is achieved by any method of polynucleotide fragmentationknown to those of skill in the art including, but not limited to,nuclease digestion (e.g., restriction enzymes, non-sequence-specificnucleases such as DNase I, micrococcal nuclease, S1 nuclease and mungbean nuclease), and physical methods such as shearing and sonication.Isolation is accomplished by any technique that allows for the selectivepurification of marked fragments from unmarked fragments (e.g., size oraffinity separation techniques and/or purification on the basis of aphysical property).

2. Methods with Enzymatic Probes

A variety of enzymatic probes can be used to identify accessible regionsof chromatin. Suitable enzymatic probes in general include any enzymethat can react with one or more sites in an accessible region to, forexample, modify a nucleotide within the region, thereby generating amodified product. The modification provides the basis for selection ofmarked polynucleotides and their separation from unmarkedpolynucleotides.

DNA methyltransferase enzymes (or simply methylases) are examples of onegroup of suitable enzymes. Of the naturally occurring nucleosides onlythymidine contains a methyl group (at the 5-position of the pyrimidinering). Bacterial and eukaryotic methylases generally add methyl groupsto nucleosides other than thymidine, to form, for example,N⁶-methyladenosine and 5-methylcytidine.

Methods employing methylases generally involve contacting cellularchromatin with a DNA methylase such that accessible DNA sequences aremethylated. The chromatin is optionally deproteinized and, in oneembodiment, the resulting methylated DNA is subsequently treated with amethylation-sensitive nuclease to generate large fragments correspondingto accessible regions. Alternatively, or in addition, methylatedchromatin or DNA is treated with a methylation-dependent nuclease (e.g.,a restriction enzyme that does not cleave at its recognition sequenceunless the recognition sequence is methylated) to generate smallfragments comprising accessible regions and larger fragments whoseboundaries comprise accessible regions. In yet another alternative,cellular chromatin is contacted with a methylase, optionallydeproteinized, fragmented, and methylated DNA fragments selected usingantibodies to methylated nucleotides or methylated DNA.

For example, in certain methods, the danz methylase (E. coli DNA adeninemethylase), which methylates the N⁶ position of adenine residues in thesequence 5′-GATC-3′, is used. This enzyme is useful in the analysis ofregulatory regions in eulcaryotic cells because adenine methylation doesnot normally occur in eukaryotic cells. Other exemplary methylasesinclude, but are not limited to, AluI methylase, BamHI methylase, ClaImethylase, EcoRI methylase, FnuDII methylase, Haef methylase, HhaImethylase, HpaII methylase, Msp I metlhylase, PstI methylase, SssImethylase, TaqI methylase, dcm (Mec) methylase, EcoK methylase and Dnmtlmethylase. These and related enzymes are commercially available, forexample, from New England BioLabs, Inc. Beverly, Mass.

Following methylase treatment, accessible regions are identified bydistinguishing methylated from non-methylated DNA. Some methods involvegenerating fragments of DNA and then separating those fragments thatinclude methylated nucleotides (ie., marked fragments) from thosefragments that are unmethylated (i.e., unmarked fragments). For example,in embodiments in which cellular chromatin is treated with daniimethylase, methylated fragments can be isolated by affinity purificationusing antibodies to N⁶-methyl adenine. Bringmann et al. (1987) FEBSLett. 213:309-315. Any affinity purification technique known in the artsuch as, for example, affinity chromatography using immobilizedantibody, can be used.

Methylated accessible regions can also be selected and isolated based ontheir possession of methylated restriction sites that are resistant tocleavage by methylation-sensitive restriction enzymes. For example,subsequent to its methylation, cellular chromatin is deproteinized andsubjected to the activity of a methylafion-sensitive restriction enzyme.A methylation-seisitive enzyme refers to a restriction enzymes that doesnot cleave DNA (or cleaves DNA poorly) if one or more nucleotides in itsrecognition site are methylated. Exemplary enzymes of this type includeMboI and DpnII, both of which digest DNA at the sequence 5′-GATC-3′ onlyif the A residue is umnethylated. (Note that this is the same sequencethat is methylated by dam methylase.) Since both of these enzymes havefour-nucleotide recognition sequences, they generate, on average, smallfragments of non-methylated DNA. Methylated regions, corresponding toareas of chromatin originally accessible to the methylase, are resistantto digestion and can be isolated, for example, based on their largersize, or through affinity methods that recognize methylated DNA (e.g.,antibodies to N⁶-methyl adenine, supra). Other methylation sensitiveenzymes include, but are not limited to, HpaII, and ClaI. See, inaddition, the New England BioLabs 2000-01 Catalogue & TechnicalReference, esp. pages 220-221 and references cited therein.

In other embodiments, preferential cleavage of methylated DNA (obtainedfrom cellular cliomatin that has been methylated as described supra) bycertain enzymes such as, for example, methylation-dependent restrictionenzymes, generates small fragments, which can be separated from larger,unmethylated DNA fragments. For example, treatment of cellular chromatinwith daniz methylase, followed by deproteinization and digestion ofmethylated DNA with DpnI (which cleaves at the 4-nucleotide recognitionsequence 5′-GATC-3′ only if the A residue is methylated) will generaterelatively small fragments from methylated accessible regions. These canbe isolated based on size or affmity procedures, as disclosed above. Inaddition, the larger fragments generated by this procedure comprise thedistal portions and boundaries of accessible regions at their terminiand can be isolated based on size. Another methylation-dependent enzyme,which cleaves at sequence different from that recognized by Dpn I, isMcr BC. This enzyme, as well as additional methylation-dependentrestriction enzymes, are disclosed in the New England BioLabs 2000-01Catalog and Technical Reference.

Additional enzymatic probes of chromatin structure, which can be used toidenfij accessible regions, include micrococcal nuclease, S1 nuclease,mung bean nuclease, and restriction endonucleases. In addition, themethod described by van Steensel et al. (2000) Nature Biotechnol.18:424428 can be used to identify accessible regions.

3. Methods with Chemical Probes

Another option for marking accessible regions in chromatin is to usevarious chemical probes. In general, these chemical probes react with afunctional group of one or more nucleotides within an accessible regionto generate a modified or derivatized nucleotide. Following cleavage ofchromatin according to the established methods described supra,fragments including one or more derivatized nucleotides can be separatedfrom those fragments that do not include modified nucleotides.

A variety of different chemical probes can be utilized to modify DNA inaccessible regions. In general, the size and reactivity of such probesshould enable the probes to react with nucleotides located withinaccessible regions. Chemical modification of cellular chromatin inaccessible regions can be accomplished by treatment of cellularchromatin with reagents such as dimethyl sulfate, hydrazine, potassiumpermanganate, and osmium tetroxide. Maxam et al. (1980) Meth.Enzymology, Vol. 65, (L. Grossman & K. Moldave, eds.) Academic Press,New York, pp. 499-560. Additional exemplary chemical modificationreagents are the psoralens, which are capable of intercalation andcrosslink formation in double-stranded DNA.

As noted supra, once cellular chromatin has been contacted with achemical probe and the reactants allowed a sufficient period in which toreact, the resulting modified chromatin is fragmented using variouscleavage methods. Exemplary techniques include reaction with restrictionenzymes, sonication and shearing methods. Following fragmentation,marked polynucleotides corresponding to accessible regions can bepurified from unmarked polynucleotides. Purification can be based onaffinity methods such as, for example, binding to antibodies specificfor the product of modification.

In certain embodiments, chemical and enzymatic probes can be combined togenerate marked fragments that can be purified from unmarked fragments.

4. Methods with Binding Molecules

In certain embodiments, a molecule which is capable of binding to anaccessible region, but does not necessarily cleave or covalently modifyDNA in the accessible region, can be used to identify and isolateaccessible regions. Suitable molecules include, for example, minorgroove binders (e.g., U.S. Pat. Nos. 5,998,140 and 6,090,947), andtriplex-forming oligonucleotides (TFOs, U.S. Pat. Nos. 5,176,996 and5,422,251). The molecule is contacted with cellular chromatin, thechromatin is optionally deproteinized, then fragmented, and fragmentscomprising the bound molecule are isolated, for example, by affinitytechniques. Use of a TFO comprising poly-inosine (poly-I) will lead tominimal sequence specificity of triplex formation, thereby maximizingthe probability of interaction with the greatest possible number ofaccessible sequences.

In a variation of one of the aforementioned methods, TFOs withcovalently attached modifying groups are used. See, for example, U.S.Pat. No. 5,935,830. In this case, covalent modification of DNA occurs inthe vicinity of the triplex-forming sequence. After optionaldeproteinization and fragmentation of treated chromatin, markedfragments are purified by, for example, affinity selection.

In another embodiment, cellular chromatin is contacted with anon-sequence-specific DNA-binding protein. The protein is optionallycrosslinked to, the chromatin. The chromatin is then fragmented, and themixture of fragments is subjected to immunoprecipitation using anantibody directed against the non-sequence-specific DNA-binding protein.Fragments in the irnmrunoprecipitate are enriched for accessible regionsof cellular chromatin. Suitable non-sequence-specific DNA-bindingproteins for use in this method include, but are not limited to,prokaryotic histone-like proteins such as the bacteriophage SP01 proteinTF1 and procaryotic HU/DBPII proteins. Greene et al. (1984) Proc. Natl.Acad. Sci. USA 81:7031-7035; Rouviere-Yaniv et al. (1977) Cold SpriigHarbor Symp. Quant. Biol. 42:439-447; Kimura et al (1983) J. Biol. Chem.258:4007-4011; Tanaka et al (1984) Nature 310:376-381. Additionalnon-sequence-specific DNA-binding proteins include, but are not limitedto, proteins containing poly-arginine motifs and sequence-specificDNA-binding proteins that have been mutated so as to retain DNA-bindingability but lose their sequence specificity. An example of such aprotein (in this case, a mutated restriction enzyme) is provided by Riceet al. (2000) Nucleic Acids Res. 28:3143-3150.

In yet another embodiment, a plurality of sequence-specific DNA bindingproteins is used to identify accessible regions of cellular chromatin.For example, a mixture of sequence-specific DNA binding proteins ofdiffering binding specificities is contacted with cellular chromatin,chromatin is fragmented and the mixture of fragments isimmunoprecipitated using an antibody that recognizes a common epitope onthe DNA binding proteins. The resulting immunoprecipitate is enriched inaccessible sites corresponding to the collection of DNA binding sitesrecognized by the mixture of proteins. Depending on the completeness ofsequences recognized by the mixture of proteins, the accessibleimmunoprecipitated sequences will be a subset or a completerepresentation of accessible sites.

In addition, synthetic DNA-binding proteins can be designed in whichnon-sequence-specific DNA-binding interactions (such as, for example,phosphate contacts) are maximized, while sequence-specific interactions(such as, for example, base contacts) are minimized. Certain zinc fingerDNA-binding domains obtained by bacterial two-hybrid selection have alow degree of sequence specificity and can be useful in theaforementioned methods. Joung et al. (2000) Proc. Natl. Acad. Sci. USA97:7382-7387; see esp. the “Group III” fingers described therein.

C. Selective/Limited Digestion Methods

1. Limited Nuclease Digestion

This approach generally involves treating nuclei or chromatin undercontrolled reaction conditions with a chemical and/or enzymatic probesuch that small fragments of DNA are generated from accessible regions.The selective and limited digestion required can be achieved bycontrolling certain digestion parameters. Specifically, one typicallylimits the concentration of the probe to very low levels. The durationof the reaction and/or the temperature at which the reaction isconducted can also be regulated to control the extent of digestion todesired levels. More specifically, relatively short reaction times, lowtemperatures and low concentrations of probe can be utilized.

Any of a variety of nucleases can be used to conduct the limiteddigestion. Both non-sequence-specific endonucleases such as, forexample, DNase I, S1 nuclease, and mung bean nuclease, andsequence-specific nucleases such as, for example, restriction enzymes,can be used.

A variety of different chemical probes can be utilized to cleave DNA inaccessible regions. Specific examples of suitable chemical probesinclude, but are not limited to, hydroxyl radicals andmethidiumpropyl-EDTA.Fe(II) (MPE). Chemical cleavage in accessibleregions can also be accomplished by treatment of cellular chromatin withreagents such as dimethyl sulfate, hydrazine, potassium permanganate,and osmium tetroxide, followed by exposure to alkaline conditions (e.g.,1 M piperidine). See, for example, Tullius et al. (1987) Meth.Enzymology, Vol. 155, (J. Ableson & M. Simon, eds.) Academic Press, SanDiego, pp. 537-558; Cartwright et al. (1983) Proc. Natl. Acad. Sci. USA80:3213-3217; Hertzberg et al. (1984) Biochemistry 23:3934-3945;Wellinger et al. in Methods in Molecular Biology, Vol. 119 (P. Becker,ed.) Humana Press, Totowa, N.J., pp. 161-173; and Maxam et al. (1980)Meth. Enzymology, Vol. 65, (L. Grossman & K. Moldave, eds.) AcademicPress, New York, pp. 499-560.

When using chemical probes, reaction conditions are adjusted so as tofavor the generation of, on average, two sites of reaction peraccessible region, thereby releasing relatively short DNA fragments fromthe accessible regions.

As with the previously-described methods, the resulting small fragmentsgenerated by the digestion process can be purified by size (e.g., gelelectrophoresis, sedimentation, gel filtration), preferentialsolubility, or by procedures which result in the separation of nakednucleic acid (i.e., nucleic acids lacking histones) from bulk chromatin,thereby allowing the small fragments to be isolated and/or cloned,and/or subsequently analyzed by, for example, nucleotide sequencing.

In one embodiment of this method, nuclei are treated with lowconcentrations of DNase; DNA is then purified from the nuclei andsubjected to gel electrophoresis. The gel is blotted and the blot isprobed with a short, labeled fragment corresponding to a known mappedDNase hypersensitive site located, for example, in the promoter of ahousekeeping gene. Examples of such genes (and associated hypersensitivesites) include, but are not limited to, those in the genes encodingrDNA, glyceraldehyde-3-phosphate dehydrogenase (GAPDH) and core histones(e.g., H2A, H2B, H3, H4). Alternatively, a DNA fragment size fraction isisolated from the gel, slot-blotted and probed with a hypersensitivesite probe and a probe located several kilobases (kb) away from thehypersensitive site. Preferential hybridization of the hypersensitivesite probe to the size fraction is indicative that the fraction isenriched in accessible region sequences. A size fraction enriched inaccessible region sequences can be cloned, using standard procedures, togenerate a library of accessible region sequences.

In certain embodiments, regulatory regions are obtained essentially asfollows:

(i) isolate intact nuclei from any cell type;

(ii) digest genomic DNA within nuclei using selected restriction enzymesand/or nucleases (e.g., DNAse I), under conditions optimized to allow,on average, a single cleavage per accessible region;

(iii) deproteinize the DNA, preferably under conditions that avoidshearing (e.g. embedding nuclei in agarose);

(iv) shear deproteinized DNA to an average size of 500 bp, e.g., bydigestion with a restriction enzyme that yields DNA fragments withdefined cohesive ends under controlled conditions; and

(v) clone fragments with one end cleaved by the nuclease (from step ii)and the other end cleaved during shearing (step iv) from the resultinggenomic DNA pool. Clones in the resulting library comprise regulatoryDNA sequences active in the cell type used.

In certain embodiments, the regulatory DNA is prepared, in part, byexposing cell nuclei to DNAseI. Preferably, the exposure to DNAseI isconducted under conditions such that the DNAseI does not substantiallycleave in non-accessible regions and under conditions such that thechromatin does not shear. See, also, Examples.

Micrococcal nuclease (MNase) is used as a probe of chromatin structurein other methods to identify accessible regions. MNase preferentiallydigests the linker DNA present between nucleosomes, compared to bulkchromatin. Regulatory sequences are often located in linker DNA, tofacilitate their ability to be bound by transcriptional regulatorymolecules. Consequently, digestion of chromatin with MNasepreferentially digests regions of chromatin that often includeregulatory sites. Because MNase digests DNA between nucleosomes,differences in nucleosome positioning on specific sequences, betweendifferent cells, can be revealed by analysis of MNase digests ofcellular chromatin using techniques such as, for example, indirectend-labeling. Since alterations in nucleosome positioning are oftenassociated with changes in gene regulation, sequences associated withchanges in nucleosome positioning are likely to be regulatory sequences.

The borders of accessible regions can be localized, if necessary,utilizing the technique of indirect end-labeling. In this method, acollection of DNA fragments obtained as described above (i.e., reactionof nuclei or cellular chromatin with a probe or cleavage agent followedby deproteinization) is digested with a restriction enzyme to generaterestriction fragments that include the regions of interest. Suchfragments are then separated by gel electrophoresis and blotted onto amembrane. The membrane is then hybridized with a labeled hybridizationprobe complementary to a short region at one end of the restrictionfragment containing the region of interest. In the absence of anaccessible region, the hybridization probe identifies the full-lengthrestriction fragment. However, if an accessible region is present withinthe sequences defined by the restriction fragment, the hybridizationprobe identifies one or more DNA species that are shorter than therestriction fragment. The size of each additional DNA speciescorresponds to the distance between an accessible region and the end ofthe restriction fragment to which the hybridization probe iscomplementary.

2. Release of Sequences Enriched in CpG Islands

The dinucleotide CpG is severely underrepresented in mammalian genomesrelative to its expected statistical occurrence frequency of 6.25%. Inaddition, the bulk of CpG residues in the genome are methylated (withthe modification occurring at the 5-position of the cytosine base). As aconsequence of these two phenomena, total human genomic DNA isremarkably resistant to, for example, the restriction endonuclease HpaII, whose recognition sequence is CCGG, and whose activity is blocked bymethylation of the second cytosine in the target site.

An important exception to the overall paucity of demethylated Hpa IIsites in the genome are exceptionally CpG-rich sequences (so-called “CpGislands”) that occur in the vicinity of transcriptional startsites, andwhich are demethylated in the promoters of active genes. Jones et al.(1999) Nature Genet. 21:163-167. Aberrant hypermethylation of suchpromoter-associated CpG islands is a well-established characteristic ofthe genome of malignant cells. Robertson et al (2000) Carcinogenesis21:61-467.

Accordingly, another option for generating accessible regions relies onthe observation that, whereas most CpG dinucleotides in the eukaryoticgenome are methylated at the C5 position of the C residue, CpGdinucleotides within the CpG islands of active genes are unmethylated.See, for example, Bird (1992) Cell 70:5-8; and Robertson et al. (2000)Carcinogenesis 21:461-467. Indeed, methylation of CpG is one mechanismby which eukaryotic gene expression is repressed. Accordingly, digestionof cellular DNA with a methylation-sensitive restriction enzyme (i.e.,one that does not cleave methylated DNA), especially one with thedinucleotide CpG in its recognition sequence, such as, for example, HpaII, generates small fragments from unmethylated CpG island DNA. Forexample, upon the complete digestion of genomic DNA with Hpa II, theoverwhelming majority of DNA will remain >3 kb in size, whereas the onlyDNA fragments of approximately 100-200 bp will be derived fromdemethylated, CpG-rich sequences, i.e., the CpG islands of active genes.Such small fragments are enriched in regulatory regions that are activein the cell from which the DNA was derived. They can be purified bydifferential solubility or size selection, for example, cloned togenerate a library, and their nucleotide sequences determined and placedin one or more databases. Arrays comprising such sequences can beconstructed.

Digestion with methylation-sensitive enzymes, optionally in the presenceof one or more additional nucleases, can be conducted in whole cells, inisolated nuclei, with bulk chromatin or with naked DNA obtained afterstripping proteins from chromatin. In all instances, relatively smallfragments are excised and these can be separated from the bulk chromatinor the longer DNA fragments corresponding to regions containingmethylated CpG dinucleotides. The small fragments includingunmnethylated CpG islands can be isolated from the larger fragmentsusing various size-based purification techniques (e.g., gelelectrophoresis, sedimentation and size-exclusion columns) ordifferential solubility (e.g., polyethyleneimine, spermine, spermidine),for example.

As indicated above, a variety of methylation-sensitive restrictionenzymes are commercially available, including, but not limited to,DpnII, MboI, HpaII and ClaI. Each of the foregoing is available fromcommercial suppliers such as, for example, New England BioLabs, Inc.,Beverly, Mass.

In another embodiment, enrichment of regulatory sequences isaccomplished by digestion of deproteinized genomic DNA with agents thatselectively cleave AT-rich DNA. Examples of such agents include, but arenot limited to, restriction enzymes having recognition sequencesconsisting solely of A and T residues, and single strand-specificnucleases, such as S1 and mung bean nuclease, used at elevatedtemperatures. Examples of suitable restriction enzymes include, but arenot limited to, Mse I, Tsp509 I, Ase I, Dra I, Pac I, Psi I, Ssp I andSwa I. Such enzymes are available commercially, for example, from NewEngland Biolabs, Beverly, Mass. Because of the concentration of GC-richsequences within CpG islands (see, above), large fragments resultingfrom such digestion generally comprise CpG island regulatory sequences,especially when a restriction enzyme with a four-nucleotide recognitionsequence consisting entirely of A and T residues (e.g., Mse I, Tsp509I), is used as a digestion agent. Such large fragments can be separated,based on their size, from the smaller fragments generated from cleavageat regions rich in AT sequences. In certain cases, digestion withmultiple enzymes recognizing AT-rich sequences provides greaterenrichment for regulatory sequences.

Alternatively, or in addition to a size selection, large, CpGisland-containing fragments generated by these methods can be subjectedto an affinity selection to separate methylated from unmethylated largefragments. Separation can be achieved, for example, by selective bindingto a protein containing a metliylated DNA binding domain (Hendrich etal. (1998) Mol. Cell. Biol. 18:6538-6547; Bird et al. (1999) Cell99:451-454) and/or to antibodies to methylated cytosine. Unmethylatedlarge fragments are likely to comprise regulatory sequences involved ingene activation in the cell from which the DNA was derived. As withother embodiments, polynucleotides obtained by the aforementionedmethods can be cloned to generate a library of regulatory sequencesand/or the regulatory sequences can be immobilized on an array.

Regardless of the particular strategy employed to purify theunmethylated CpG islands from other fragments, the isolated fragmentscan be cloned to generate a library of regulatory sequences. Thenucleotide sequences of the members of the library can be determined,optionally placed in one or more databases, and compared to a genomedatabase to map these regulatory regions on the genome.

D. Immunonrecipitation

In other methods for identification and isolation of regulatory regions,enrichment of regulatory DNA sequences takes advantage of the fact thatthe chromatin of actively transcribed genes generally comprisesacetylated histones. See, for example, Wolffe et al. (1996) Cell84:817-819. In particular, acetylated H3 and H4 are enriched in thechromatin of transcribed genes, and chromatin comprising regulatorysequences is selectively enriched in acetylated H3. Accordingly,chromatin immunoprecipitation using antibodies to acetylated histones,particularly acetylated H3, can be used to obtain collections ofsequences enriched in regulatory DNA.

Such methods generally involve fragmenting chromatin and then contactingthe fragments with an antibody that specifically recognizes and binds toacetylated histones, particularly H3. The polynucleotides from theimmunoprecipitate can subsequently be collected from theimmunoprecipitate. Prior to fragmenting the chromatin, one canoptionally crosslink the acetylated histones to adjacent DNA.Crosslinking of bistones to the DNA within the chromatin can beaccomplished according to various methods. One approach is to expose thechromatin to ultraviolet irradiation. Gilmour et al. (1984) Proc. Natl.Acad. Sci. USA 81:4275-4279. Other approaches utilize chemicalcrosslinking agents. Suitable chemical crosslinking agents include, butare not limited to, formaldehyde and psoralen. Solomon et al. (1985)Proc. Natl. Acad. Sci. USA 82:6470-6474; Solomon et al. (1988) Cell53:937-947.

Fragmentation can be accomplished using established methods forfragmenting chromatin, including, for example, sonication, shearingand/or the use of restriction enzymes. The resulting fragments can varyin size, but using certain sonification techniques, fragments ofapproximately 200-400 nucleotide pairs are obtained.

Antibodies that can be used in the methods are commercially availablefrom various sources. Examples of such antibodies include, but are notlimited to, Anti Acetylated Histone H3, available from UpstateBiotechnology, Lake Placid, N.Y.

Additional chromatin modifications of a regulatory nature, that can beidentified with antibodies include, but are not limited to: globalacetylation, lysine 5 acetylation, lysine 7 acetylation and lysine 9acetylation of histone H2A; global acetylation, lysine 5 acetylation,lysine 12 acetylation, lysine 15 acetylation, lysine 16 acetylation,lysine 20 acetylation and serine 14 phosphorylation of histone H2B;global acetylation, lysine 4 methylation, lysine 9 methylation, lysine 9trimethylation, lysine 9 acetylation, serine 10 phosphorylation, lysine14 acetylation, arginine 26 methylation and lysine 28 methylation ofhistone H3; and global acetylation, lysine 8 acetylation, lysine 12acetylation, lysine 16 acetylation and lysine 20 methylation of histoneH4. Antibodies can be obtained, for example, from Abcam or UpstateBiotechnology and can comprise panels of distinct sera that distinguishamong monomethylated, dimethylated and trimethylated lysine.

Identification of a binding site for a particular defined transcriptionfactor in cellular chromatin is indicative of the presence of regulatorysequences. This can be accomplished, for example, using the technique ofclromatin imniunoprecipitation. Briefly, this technique involves the useof a specific antibody to immunoprecipitate chromatin complexescomprising the corresponding antigen (in this case, the transcriptionfactor of interest), and examination of nucleotide sequences, present inthe immunoprecipitate, that are crosslinked to the antigen.Immunoprecipitation of a particular sequence by the antibody isindicative of interaction of the antigen with that sequence. See, forexample, O'Neill et al. in Methods in Enzymology, Vol. 274, AcademicPress, San Diego, 1999, pp. 189-197; Kuo et al. (1999) Method19:425-433; and Current Protocols in Molecular Biology, F. M. Ausubel etal., eds., Current Protocols, Chapter 21, a joint venture between GreenePublishing Associates, Inc. and John Wiley & Sons, Inc., (1998Supplement). After reversal of crosslinks, the released sequences can becloned, sequenced and/or placed on an array.

As with the other methods, polynucleotides isolated from aninimunoprecipitate, as described herein, can be cloned to generate alibrary and/or sequenced, and/or the sequences can be placed on anucleic acid array as described in greater detail below. Sequencesadjacent to those detected by this method are also likely to beregulatory sequences. These can be identified by mapping the isolatedsequences on the genome sequence for the organism from which thechromatin sample was obtained, and optionally entered into one or moredatabases.

E. Mapping DNase Hypersensitive Sites Relative to a Gene of Interest

A rapid method for mapping DNase hypersensitive sites (which cancorrespond to boundaries of accessible regions) with respect to aparticular gene involves ligation of an adapter oligonucleotide to theDNA ends generated by DNase action, followed by amplification using anadapter-specific primer and a gene-specific primer. For this procedure,nuclei or isolated cellular chromatin are treated with a nuclease suchas, for example, DNase I or micrococcal nuclease, and thechromatin-associated DNA is then purified. Purified, nuclease-treatedDNA is optionally treated so as to generate blunt ends at the sites ofnuclease action by, for example, incubation with T4 DNA Polyrnerase andthe four deoxyribonucleoside triphosphates. After this treatment, apartially double-stranded adapter oligonucleotide is ligated to the DNAends. The adapter contains a 5′-hydroxyl group at its blunt end and a5′-extension, terminated with a 5′-phosphate, at the other end. The5′-extension is an integral number of nucleotides greater that onenucleotide, preferably greater than 5 nucleotides, preferably greaterthan 10 nucleotides, more preferably 14 nucleotides or greater.Alternatively, a 5′-extension need not be present, as long as one of the5′ ends of the adapter is unphosphorylated. This procedure generates apopulation of DNA molecules whose termini are defined by sites ofnuclease action, with the aforementioned adapter ligated to thosetermini.

The DNA is then purified and subjected to amplification (e.g., PCR). Oneof the primers corresponds to the longer, 5′-phosphorylated strand ofthe adapter, and the other is complementary to a known site in the geneof interest or its vicinity. Amplification products are analyzed by, forexample, gel electrophoresis. The size of the amplification product(s)indicates the distance between the site that is complementary to thegene-specific primer and the proximal border of an accessible region (inthis case, a nuclease hypersensitive site). In additional embodiments, aplurality of second primers, each complementary to a segment of adifferent gene of interest, is used, to generate a plurality ofamplification products.

In additional embodiments, nucleotide sequence determination can beconducted during the amplification. Such sequence analyses can beconducted individually or in multiplex fashion.

While the foregoing discussion on mapping has referred primarily tocertain nucleases, it will be clear to those skilled in the art that anyenzymatic or chemical agent, or combination thereof, capable of cleavagein an accessible region, can be used in the mapping methods justdescribed.

F. Footprinting

Yet another method for identifing regulatory regions in cellularchromatin is by in vivo footprinting, a technique in which theaccessibility of particular nucleotides (in a region of interest) toenzymatic or chemical probes is determined. Differences in accessibilityof particular nucleotides to a probe, in different cell types, canindicate binding of a transcription factor to a site encompassing thosenucleotides in one of the cell types being compared. The site can beisolated, if desired, by standard recombinant methods. See Wassarman andWolffe (eds.) Methods in Enzymology, Volume 304, Academic Press, SanDiego, 1999.

G. In Vitro v. In Vivo Methods

Certain methods can optionally be performed in vitro or in vivo. Forinstance, treatment of cellular chromatin with chemical or enzymaticprobes can be accomplished using isolated chromatin derived from a cell,and contacting the isolated chromatin with the probe ill vitro. Methodsthat depend on methylation status can, if desired, be performed in vitrousing naked genomic DNA. Alternatively, isolated nuclei can be contactedwith a probe iil vivo. In certain other in vivo methods, a probe can beintroduced into living cells. Cells are permeable to some probes. Forother probes, such as proteins, various methods, known to those of skillin the art, exist for introduction of macromolecules into cells.Alternatively, a nucleic acid encoding an enzymatic probe, optionally ina vector, can be introduced into cells by established methods, such thatthe nucleic acid encodes an enzymatic probe that is active in the cellin vivo. Methods for the introduction of proteins and nucleic acids intocells are known to those of skill in the art and are disclosed, forexample, in co-owned PCT publication WO 00/41566. Methods formethylating chromatin in vivo using recombinant constructs aredescribed, for example, by Wines, et al. (1996) Chromasoma 104:332-340;Kladde, et al. (1996) EMBO J. 15: 6290-6300, and van Steensel, B. andHenikoff, S. (2000) Nature Biotechnology 18:424-428, each of which isincorporated by reference in its entirety. It is also possible tointroduce constructs into a cell to express a protein that cleaves theDNA such as, for example, a nuclease or a restriction enzyme. See, forexample, U.S. Pat. No. 5,792,640.

H. Deproteinization

As described above in the various isolation schemes, with certainmethods it is desirable or necessary to deproteinize the chromatin orchromatin fragments. This can be accomplished utilizing establishedmethods that are known to those of skill in the art such as, forexample, phenol extraction. Various kits and reagents for isolation ofgenomic DNA can also be used and are available commercially, forexample, those provided by Qiagen (Valencia, Calif.).

I. Hypersensitive Site Mapping to Confirm Identification of AccessibleRegions

As disclosed herein, accessible regions can be identified by any numberof methods. Collections of accessible region sequences from a particularcell can be cloned to generate a library, polynucleotides from thelibrary, or portions or complements thereof, can be placed on an array,and the nucleotide sequences of the members of the library can bedetermined to generate a database specific to the cell from which theaccessible regions were obtained. Confimmation of the identification ofa cloned insert in a library as comprising an accessible region isaccomplished, if desired, by mapping the cloned sequence on the genomeand conducting DNase hypersensitive site mapping on cellular chromatinin the vicinity of the mapped cloned sequence. Co-localization of aparticular cloned sequence with a DNase hypersensitive site validatesthe identity of the insert as an accessible regulatory region. Once asuitable number of distinct inserts are confimmed to reside within DNasehypersensitive sites iil vivo, larger-scale sequencing and annotationprojects can be initiated. For example, a large number of libraryinserts can be sequenced and their map locations determined bycomparison with genome sequence databases. For a given accessible regionsequence, the closest ORF (open reading frame) in the genome isprovisionally assigned as the target locus regulated by sequences withinthe accessible region. In this way, a large number of ORFs in the genomeacquire one or more potential regulatory domains, the function of whichcan be confimmed by standard procedures.

It will be apparent that certain of the methods described herein can beused in combination to provide confimmation and additional information.For example, treatment of nuclei or cellular chromatin with a probe canbe followed by any or all of: isolation of libraries of accessible DNAsequences, mapping the sites of probe reactivity and attaching one ormore accessible sequences from the library to an array. Arrays ofregulatory sequences are useful in a number of methods, as describedbelow.

IV. Libraries of Accessible Polynucleotides and Sequence Determination

A. Library Formation

The isolated accessible regions can be used to form libraries ofaccessible regions; generally the libraries correspond to regions thatare accessible for a particular cell. As used herein, the term “library”refers to a pool of DNA fragments that have been propagated in some typeof a cloning vector. The libraries of regulatory domains will typicallycontain a single accessible DNA fragment per clone.

Accessible regions isolated by methods disclosed herein can be clonedinto any known vector according to established methods. In general,isolated DNA fragments are optionally cleaved, tailored (e.g., madeblunt-ended or subjected to addition of oligonucleotide adapters) andthen inserted into a desired vector by, for example, ligase- ortopoisomerase-mediated enzymatic ligation or by chemical ligation. Toconfimm that the correct sequence has been inserted, the vectors can beanalyzed by standard techniques such as restriction endonucleasedigestion and nucleotide sequence determination.

Additional cloning and ili vitiro amplification methods suitable for theconstruction of recombinant nucleic acids are well known to persons ofskill in the art. Examples of these techniques and instructionssufficient to direct persons of skill through many cloning techniquesare found in Berger and Kimmel, Guide to Molecular Cloning Techniques,Methods in Enzymology, Volume 152, Academic Press, Inc., San Diego,Calif. (Berger); Current Protocols in Molecular Biology, F. M. Ausubelet al., eds., Current Protocols in Molecular Biology, a joint venturebetween Greene Publishing Associates, Inc. and John Wiley & Sons, Inc.,(1987 and periodic updates) (Ausubel); and Sambrook, et al. (2001)Molecular Cloning: A Laboratory Manual, 3rd ed., each of which isincorporated by reference in its entirety.

A variety of common vector backbones are well known in the art. Forcloning in bacteria, comnion vectors include pBR322 and vectors derivedtherefrom, such as pBLUESCRIP™, the pUC series of plasmids, as well asλ-phage derived vectors. In yeast, vectors that can be used includeYeast Integrating plasmids (e.g., YIp5) and Yeast Replicating plasmids(the YRp series plasmids), the pYES series and pGPD-2 for example.Expression in mammalian cells can be achieved, for example, using avariety of commonly available plasmids, including pSV2, pBC12BI, andp91023, the pCDNA series, pCMV1, pMAMneo, as well as lytic virus vectors(e.g., vaccinia virus, adenovirus), episomal virus vectors (e.g., bovinepapillomavirus), and retroviral vectors (e.g., murine retroviruses).Expression in insect cells can be achieved using a variety ofbaculovirus vectors, including pFastBac1, pFastBacHT series,pBluesBac4.5, pBluesBacHis series, pMelBac series, and pVL1392/1393, forexample. Additional vectors and host cells are well known to those ofskill in the art in view of the teachings herein.

The libraries formed thus represent regulatory regions from any celltype and/or subject, for example untransformed human cells and/or one ormore cancer cell lines. Non-limiting examples of suitable cells fromwhich to prepare DNA regulatory libraries described herein includeprimary foreskin fibroblasts (ATCC CRL-2522); white blood cells filteredfrom whole blood (Memorial Blood Centers of Minnesota); pooled placentalcells (CHORI); skeletal myocytes (Clonetics); and MCF-7 cells, a breastcarcinoma cell line (ATCC HTB-22). Any other cell type can be used, forexample any of the cell types available from the ATCC.

Furthermore, because genome activity is cell-type specific, and becauseregulatory DNA activity correlates with that of the genome, a panel ofregulatory DNA libraries from cell types from major embryonic lineages(e.g., ectoderm, endoderm, and mesodern) can be generated. Male and/orfemale cells are used, depending on the application, although male cellsmay be preferred in certain instances to ensure inclusion ofY-chromosome specific regulatory DNA.

In addition, the regulatory sequence in each clone can be virtually anylength, and is preferably between about 25 bp and about 1,000 bp inlength (or any value therebetween), more preferably between about 50 andabout 500 bp in length (or any value therebetween), or between about 100and 300 bp in length (or any value therebetween). As noted above,regulatory sequences can be isolated from any cell type.

The size (number of clones) in each library may vary, for example withbetween several hundred to a hundred thousand or more members (clones).For example, the regulatory DNA library prepared from HEK 293 cellsdescribed in the Examples included approximately 40,000 differentclones.

Alternatively or in addition, such individual libraries can be combinedto form a collection of libraries. Essentially any number of librariescan be combined. Typically, a collection of libraries contains at least2, 5 or 10 libraries, each library corresponding to a different type ofcell or a different cellular state. For example, a collection oflibraries can comprise a library from cells infected with one or morepathogenic agents and a library from counterpart uninfected cells.Determination of the nucleotide sequences of the members of a librarycan be used to generate a database of accessible sequences specific to aparticular cell type.

In a separate embodiment, subtractive hybridization and/or differenceanalysis techniques can be used in the analysis of two or morecollections of accessible sequences, obtained by any of the methodsdisclosed herein, to isolate sequences that are unique to one or more ofthe collections. For example, accessible sequences from normal cells canbe subtracted from accessible sequences present in virus-infected cellsto obtain a collection of accessible sequences unique to thevirus-infected cells. Conversely, accessible sequences fromvirus-infected cells can be subtracted from accessible sequences presentin uninfected cells to obtain a collection of sequences that becomeinaccessible in virus-infected cells. Such unique sequences obtained bysubtraction can be used to generate libraries and/or databases. Methodsfor subtractive hybridization and difference analysis are known to thoseof skill in the art and are disclosed, for example, in U.S. Pat. Nos.5,436,142; 5,501,964; 5,525,471 and 5,958,738.

Analysis (e.g., nucleotide sequence determination) of libraries ofaccessible region sequences can be facilitated by concatenating a seriesof such sequences with interposed marker sequences, using methodssimilar to those described in U.S. Pat. Nos. 5,695,937 and 5,866,330.

B. High-Throughput Library Construction

Rapid, high-throughput construction of libraries of accessible regionscan be achieved using a combination of nuclease digestion andligation-mediated PCR. Pfeifer et al. (1993) Meth. In Mol. Biol.15:153-168; Mueller et al. (1994) In: Current Protocols in MolecularBiology, ed. F. M. Ausubel et al., John Wiley & Sons; Inc., vol. 2, pp.15.5.1-15.5.26. Nuclei or isolated cellular chromatin are subjected tothe action of one or more nucleases such as, for example, a restrictionenzyme, DNase I and/or micrococcal nuclease, and the digested DNA ispurified and end-repaired using, for example, T4 DNA polymerase and thefour deoxyribonucleoside triphosphates. A ligation reaction is conductedusing, as substrates, the nuclease-digested, end-repaired chromosomalDNA and a double-stranded adapter oligonucleotide. The adapter has oneblunt end, containing a 5′-phosphate group, which is ligated to the endsgenerated by nuclease action. The other end of the adapteroligonucleotide has a 3′ extension and is not phosphorylated (andtherefore is not capable of being ligated to another DNA molecule). Inone embodiment, this extension is two bases long and has the sequenceTT, although any size extension of any sequence can be used.

Adapter-ligated DNA is digested with a restriction enzyme that generatesa blunt end. Preferably, the restriction enzyme has a four-nucleotiderecognition sequence. Examples include, but are not limited to, Rsa I,Hae III, Alu I, Bst UI, and Cac81. Altematively, DNA can be digestedwith a restriction enzyme that does not generate blunt ends, and thedigested DNA can optionally be treated so as to produce blunt ends by,for example, exposure to T4 DNA Polynierase and the four deoxynucleosidetriphosphates.

Next, a primer extension reaction is conducted, using Taq DNA polymeraseand a primer complementary to the adapter. The product of the extensionreaction is a double-stranded DNA molecule having the followingstructure: adapter sequence/nuclease-generated end/internalsequence/restriction enzyme-generated end/3′terminal A extension. The3′-terminal A extension results from the terminal transferase activityof the Taq DNA Polymerase used in the primer extension reaction.

The end containing the 3′-terminal A extension (i e., the end originallygenerated by restriction enzyme digestion after ligation of the adapter)is joined, by DNA topoisomerase, to a second double-stranded adapteroligonucleotide containing a 3′-terminal T extension. In one embodiment,prior to joining, the adapter oligonucleotide is covalently linked,through the 3′-phosphate of the overhanging T residue, to a molecule ofDNA topoisomerase. See, for example, U.S. Pat. No. 5,766,891. Thisresults in the production of a molecule containing a first adapterjoined to the nuclease-generated end and a second adapter joined to therestriction enzymne-generated end. This molecule is then amplified usingprimers complementary to the first and second adapter sequences.Amplification products are cloned to generate a library of accessibleregions and the sequences of the inserts can be determined to generate adatabase. The accessible regions can be placed on an array.

In the practice of the aforementioned method, it is possible to obtainDNA fragments in which both ends of the fragment have resulted fromnuclease cleavage (N—N fragments). These fragments will contain both thefirst and second adapters on each end, with the first adapter internalto the second. Any given fragment of this type will theoretically yieldfour amplification products which, in sum, will be amplified twice asefficiently as a fragment having one nuclease-generated end and onerestriction enzyme-generated end (N—R fragments). Thus, the finalpopulation of amplified material will comprise both N—N fragments andN—R fragments. Amplification using only one of the two primers willyield a population of amplified molecules that is enriched for N—Nfragments (which will, under these conditions, be amplifiedexponentially, while N—R fragments will be amplified in a linearfashion). A population of amplification products enriched in N—Rfragments can be obtained by subtracting the N—N population from thetotal population of amplification products. Methods for subtraction andsubtractive hybridization are known to those of skill in the art. See,for example, U.S. Pat. No. 5,436,142; 5,501,964; 5,525,471 and5,958,738.

In another embodiment, cellular chromatin is subjected to limitednuclease action, and fragments having one end defined by nucleasecleavage are preferentially cloned. For example, isolated chromatin orpermeabilized nuclei are exposed to low concentrations of a nuclease(e.g., DNase I restriction enzyme), optionally for short periods of time(e.g., one minute) and/or at reduced temperature (e.g., lower than 37°C.). DNase-treated chromatin is then deproteinized and the resulting DNAis digested to completion with a restriction enzyme, preferably onehaving a four-nucleotide recognition sequence. Any or all of the stepsof nuclease treatment, deproteinization and restriction enzyme digestionare optionally conducted on DNA that has been embedded in agarose, toprevent shearing which would generate artifactual ends.

Preferential cloning of nuclease-generated fragments is accomplished bya number of methods. For example, prior to restriction enzyme digestion,nuclease-generated ends can be rendered blunt-ended by appropriatenuclease and/or polymerase treatment (e.g. T4 DNA polymerase plus the 4dNTPs). Following restriction digestion, fragments are cloned into avector that has been cleaved to generate a blunt end and an end that iscompatible with that produced by the restriction enzyme used to digestthe nuclease-treated chromatin. For example, if Sau 3AI is used fordigestion of nuclease-treated chromatin, the vector can be digested withBam HI (which generates a cohesive end compatible with that generated bySau 3AI) and Eco RV or Sma I (either of which generates a blunt end).

Ligation of adapter oligonucleotides, to nuclease-generated ends and/orrestriction enzyme-generated ends, can also be used to assist in thepreferential cloning of fragments containing a nuclease-generated end.For example, a library of accessible sequences is obtained by selectivecloning of fragments having one blunt end (corresponding to a site ofnuclease action in an accessible region) and one cohesive end, asfollows. Nuclease-treated chromatin is digested with a first restrictionenzyme that produces a single-stranded extension to generate apopulation of fragments, some of which have one nuclease-generated endand one restriction enzyme-generated end and others of which have tworestriction enzyme-generated ends. If this collection of fragments isligated to a vector that has been digested with the first restrictionenzyme (or with an enzyme that generates cohesive termini that arecompatible with those generated by the first restriction enzyme),fragments having two restriction enzyme-generated ends will generatecircular molecules, while fragments having a restrictionenzyme-generated end and a nuclease-generated end will only ligate atthe restriction enzyme-generated end, to generate linear moleculesslightly longer than the vector. Isolation of these linear molecules(from the circular molecules) provides a population of sequences havingone end generated by nuclease action, which thereby correspond toaccessible sequences. Separation of linear DNA molecules from circularDNA molecules can be achieved by methods well known in the art,including, for example, gel electrophoresis, equilibrium densitygradient sedimentation, velocity sedimentation, phase partitioning andselective precipitation. The isolated linear molecules are then renderedblunt ended by, for example, treatment with a DNA polymerase (e.g., T4DNA polymerase, E. coli DNA polymerase I Klenow fragment) optionally inthe presence of nucleoside triphosphates, and recircularized by ligationto generate a library of accessible sequences.

An alternative embodiment for selective cloning of fragments having onenuclease-generated end and one restriction enzyme-generated end is asfollows. After restriction enzyme digestion of nuclease-treatedchromatin, protruding restriction enzyme-generated ends are “capped” byligating, to the fragment population, an adapter oligonucleotidecontaining a blunt end and a cohesive end that is compatible with theend generated by the restriction enzyme, which reconstitutes therecognition sequence. The fragment population is then subjected toconditions that convert protruding ends to blunt ends such as, forexample treatment with a DNA polynierase in the presence of nucleosidetriphosphates. This step converts nuclease-generated ends to blunt ends.The fragments are then re-cleaved with the restriction enzyme toregenerate protruding ends on those ends that were originally generatedby the restriction enzyme. This results in the production of twopopulations of fragments. The first (desired) population comprisesfragments having one nuclease-generated blunt end and one restrictionenzyme-generated protruding end; these fragments are derived fromaccessible regions of cellular chromatin. The second populationcomprises fragments having two restriction enzyme-generated protrudingends. Ligation into a vector containing one blunt end and one endcompatible with the restriction enzyme-generated protruding end resultsin cloning of the desired fragment population to generate a library ofaccessible sequences.

An additional exemplary method for selecting against cloning offragments having two restriction enzyme-generated ends involves ligationof nuclease-treated, restriction enzyme digested DNA to a linearizedvector whose ends are compatible only with the ends generated by therestriction enzyme. For example, if Sau 3AI is used for restrictiondigestion, a Bam HI-digested vector can be used. In this case, fragmentshaving two Sau 3AI ends will be inserted into the vector, causingrecircularization of the linear vector. For fragments having anuclease-generated end and a restriction enzyie-generated end, only therestriction enzyme-generated end will be ligated to the vector; thus theligation product will remain a linear molecule. In certain embodiments,E. coli DNA ligase is used, since this enzyme ligates cohesive-endedmolecules at a much higher efficiency than blunt-ended molecules.Separation of linear from circular molecules, and recovery of the linearmolecules, generates a population of molecules enriched in the desiredfragments. Such separation can be achieved, for example, by gelelectrophoresis, dextran/PEG partitioning and/or spermine precipitation.Alberts (1967) Meth. Enzymology 12:566-581; Hoopes et al. (1981) NucleicAcids Res. 9:5493-5504. End repair of the selected linear molecules,followed by recircularization, results in cloning of sequences adjacentto a site of nuclease action.

Size fractionation can also be used, separately or in connection withthe other methods described above. For example, after restrictiondigestion, DNA is fractionated by gel electrophoresis, and smallfragments (e.g., having a length between 50 and 1,000 nucleotide pairs)are selected for cloning.

In another embodiment, regulatory regions are preferentially clonedusing the unique cohesive overhang characteristic of regulatory DNA thathas been cleaved with a nuclease in chromatin (e.g., a CG overhang whenHpaII is used for cleavage). Nuclei or cellular chromatin are exposed tobrief Hpa II digestion, and the chromatin is deproteinized and digestedto completion with a secondary restriction enzyme, preferably one thathas a four-nucleotide recognition sequence (e.g., Sau3A). Any or all ofthe steps of initial cleavage (e.g., by HpaII), deproteinization andrestriction enzyme digestion are optionally conducted on DNA that hasbeen embedded in agarose, to prevent shearing that would generateartifactual ends. Fragments containing one Hpa II end and one endgenerated by the secondary restriction enzyme are preferentially clonedinto an appropriately digested vector. For example, if the secondaryrestriction enzyme is Sau 3AI, the vector can be digested with Cla I(whose end is compatible with a Hpa II end) and Bam HI (whose end iscompatible with that generated by Sau 3AI), thus leading to selectivecloning of Hpa II/Sau 3AI regulatory DNA fragments.

In certain embodiments, fragment of accessible DNA, obtained by any ofthe methods disclosed herein, can be ligated into an adapter containinga promoter (e.g., a T7 promoter, a T3 promoter or a SP6 promoter).Subsequently, the cloned regulatory DNA can be directly amplified and/orlabeled for screening using the arrays described herein, using standardmethods. Optionally, a biotinylated oligonucleotide adapter may beligated to one end (e.g., the end obtained by initial cleavage in anaccessible region) of a regulatory DNA fragment from a library, and theregulatory DNA precipitated using avidin. The strength of thebiotin-avidin interaction allows for repeated, high-stringency washes toeliminate non-regulatory DNA from the preparations. Any known bindingpair may also be used for this purpose. Similarly, the second end of theregulatory fragment (generated by the second nuclease) can be ligatedusing a second adapter specific to the end generated by the secondnuclease. Regulatory fragments can then be amplified (e.g., by PCR)using primers specific for the two adapters. Thus, ligation of adapteroligonucleotides, as described herein, to nuclease-generated ends and/orto the ends generated by the secondary restriction enzyme, can also beused to assist in the preferential cloning of fragments.

Size fractionation can also be used, separately or in connection withthe other methods described above. For example, after digestion with thesecondary restriction enzyme, DNA is fractionated by gelelectrophoresis, and small fragments (e.g., having a length between 50and 1,000 nucleotide pairs) are selected for cloning.

C. Sequencing

Purified and/or amplified DNA fragments comprising accessible regionscan be sequenced according to known methods. In some instances, theisolated polynucleotides are cloned into a vector that is introducedinto a host to amplify the sequence and the polynucleotide then purifiedfrom the cells and sequenced. Depending upon sequence length, clonedsequences can be rapidly sequenced using commercial sequencers such asthe Prism 377 DNA Sequencers available from Applied Biosystems, Inc.,Foster City, Calif.

D. Analysis/Selection of Libraries

As noted above, various techniques can be used to evaluate the libraryand determine whether it will be used for further purposes such as tomake an array. Non-limiting examples of analysis techniques includesequencing, evaluating the location of cloned fragments on the genome(e.g., in relation to DNaseI hypersites and/or genes), and/or evaluationof regulatory nature of the fragments (e.g., comparison to expressionprofiles, transcription factor site binding density, and/or conservedsequences relative to mouse genome). These methods may be used alone orin combination.

For example, any number of clones from any given library may be randomlyselected and sequenced. Clones that fall within 500 bp of transcriptionstart sites of known genes may be referred to as “promoter” clones basedon their proximity to a transcription start site. The remaining(non-promoter) clones can be evaluated to determine the percentage ofclones that co-localize with DNaseI hypersensitive sites, for example byrandomly selecting non-promoter clones and mapping chromatin structureat each location by conventional indirect end-labeling. Libraries inwhich more than 10% of the randomly selected non-promoter clones are notderived from DNaseI hypersensitive sites are typically not selected forfurther manipulations and one or more additional libraries are preparedfrom the same cell type using different experimental conditions (e.g.,lower restriction enzyme concentrations).

In addition, some or all clones in a library that lie within 10 kb ofthe transcription start site of known genes can be compared to theexpression profile of the cell type used for regulatory DNA librarypreparation using any suitable technique, for example using Affymetrixequipment that allows expression-profiling from the same cells fromwhich the regulatory DNA library is prepared.

Some or all clones (e.g., non-promoter clones) of a library can also beevaluated for transcription factor binding site density. Often, anaverage increase of at least 2-fold or 4-fold in the number oftranscription factor binding sites per fragment, relative to bulkgenomic DNA of identical GC composition, is obtained. Such evaluationcan be conducted using any suitable techniques, for example, usingpublicly available databases such as TransFac. See, for example,Wingender et al. (1997, 2001).

Sequence conservation, for example with other mammalian genomes such asmouse, can also be used to help evaluate the suitability of a particularlibrary. See, also, Pennacchio et al. (2001) Nat Rev Genet 2:100-109.Sequence analysis can be readily conducted using publicly availablegenome analysis tools. Sequence conservation analysis is rarely usedalone to identify regulatory DNA, but does provide another tool forvalidating the regulatory nature of the experimentally obtained DNAfragments. One, though not the only, criterion for suitability of alibrary is if at least about 75% of those clones that fall inmouse-human syntenic regions reside in regions of a >2.0 conservationscore as defined by the UCSC Human/Mouse Evolutionary Conservation Scoremetric (FIG. 4).

DNA libraries that meet the test criteria may then be sequenced.Preferably, sequencing is limited to the cloned DNA fragment (e.g.,about 100-500 bp). Information gathered after the initial 1,000 clonesin a library have been sequenced can be further analyzed computationallyto estimate library depth. Libraries predicted to contain >10,000 uniqueclones may then be sequenced to completion (“completion” in this case isdefined as fewer than 2% new clones identified per 100 sequence reads).Sequence information can be assembled into a database with LocusID-styleidentifiers designating each clone by cytological location and distancefrom the transcription start site of the nearest gene.

Libraries generated and sequenced from different cell types (e.g., skin,blood, muscle, placenta) may also be cross-referenced to evaluate thenumber of shared and unique clones. For example, the total number ofunique clones in the compared libraries can be assessed as well as thenumber of clones unique to each cell-specific library. These analyses,performed using standard techniques as described herein, can be used toassess whether a sufficiently representative number of regulatoryfragments are contained in the libraries. For instance, if the totalnumber of unique clones in the combined libraries exceeds approximately2 per gene, further sequencing may not be necessary and the library maybe deemed to be sufficiently representative of regulatory sequences ofthat cell type.

Libraries used to make arrays preferably include a sufficient number ofclones to represent about 80% of all regulatory sequences in the genomeunder study. Given that a conservative estimate of the total number ofregulatory DNA segments in the human genome is approximately 60,000(ie., about 2 per gene), the libraries described herein that are used tomake arrays comprising human regulatory sequences preferably representapproximately 48,000 individual regulatory DNA regions, as determinedusing one or more of the techniques set forth herein. In addition,libraries used in construction of regulatory arrays typically include atleast 10,000 clones that are located within about 1 kb of either side ofa transcription start site as measured, for example, by comparison tothe human transcriptome, as defined by UniGene.

E. Library Applications

As described in detail below, the regulatory DNA libraries describedherein are used to facilitate production of arrays of regulatory DNAs.In addition, the libraries themselves may be used for variousapplications, for example to identify unique DNA sequences for targetingof regulatory DNA binding proteins.

For example, a collection of regulatory DNA sequences is analyzed, e.g.,by a computer algorithm, and stretches of DNA unique to a particularregulatory region are identified. The identified sites representpotential target sites for binding by an engineered transcriptionfactor. Engineered transcription factors, such as zinc finger proteins(ZFPs), can be used to regulate the expression of endogenous genes incells and animals. Furthermore, engineered ZFPs can be designed torecognize any target sequence in DNA. See, e.g., U.S. Pat. Nos.6,511,808; 6,503,717; 6,453,242; 6,534,261; 6,599,692; and 6,607,882.Preferably, the target sequence is between about 9-18 bp.

Sequences unique to a regulatory region, as described above, areidentified by any suitable method, typically involving a number ofsteps. For example, genomic DNA surrounding the target gene may first beidentified (e.g., using BLAST searching capabilities). A selectedportion of the genome surrounding the target gene (approximately 20kilobases) can then be compared to the complete set of regDNA sequencesin order to identify the subset of regDNA regions that lie within theselected region. Once identified, these regDNA regions would each beparsed back against the entire regDNA database to find stretches ofapproximately 9-18 bp of unique sequence. The sequences identified asunique would be the preferred target sites for binding of a regulatoryDNA binding protein. It should be noted that the DNA binding proteindesigned to recognize the unique target site may not recognize theentire unique sequence, for example ZFPs that recognize 9 base pairsequences may be used in certain instances.

V. Arrays

Regulatory sequences present in libraries obtained as described abovecan be placed on an array or, alternatively, polynucleotide probes maybe designed to represent the clones of the libraries and the probes thenordered into one or more arrays. Preferably, unique sequence signatures(e.g., “regDNA tags”) are used, probe sets for each regDNA tag aredesigned, and the probe set is synthesized on or attached to a substratearray (e.g., regDNA chip) using standard techniques.

Methods for the construction of polynucleotide arrays are known in theart. In certain methods, each polynucleotide on the array is synthesizedini situ at a predetermined location on the array. See, for example,U.S. Pat. Nos. 5,143,854; 5,489,678; 5,744,305 and 6,600,031. In othermethods, different pre-synthesized polynucleotides are attached to asubstrate at individual, predetermined locations to form an array. See,for example, U.S. Pat. Nos. 5,807,522 and 6,110,426. Arrays can compriseDNA, RNA or other modified or synthetic polynucleotides. In addition,the arrays can comprise single-stranded polynucleotides, double-strandedpolynucleotides, or any combination. Arrays comprising single-strandedpolynucleotides can be used, e.g., for hybridization to otherpolynucleotides. Arrays comprising double-stranded polynucleotides canbe used, e.g. to assess binding of proteins to sequences on the array.Methods for production of arrays comprising double-strandedpolynucleotides are disclosed, for example, in U.S. Pat. Nos. 6,326,489and 6,548,021 and in WO 02/18648.

Members of certain of the libraries prepared as described abovetypically contain DNA fragments that identify, via theirnuclease-generated end, the precise location of a regulatory DNAelement. The other end of the DNA fragment, typically located on theorder of about 500 bp away, is generated, e.g., by a restriction enzymeduring controlled shearing. As a consequence, each specific fragmentcontains approximately 100-300 bp of a stretch of regulatory DNA, aswell as 100-400 bp of immediately adjacent sequence. Thus, the arraysdescribed herein may include the entire fragments obtained from thelibrary, the regulatory stretch alone or the adjacent sequence alone (orprobes designed to recognize, e.g., by sequence complementarity, thesefragments, regulatory stretches and or polynucleotides adjacent to theregulatory sequences). For example, if a particular regulatory DNAregion of a fiagment is deemed unsuitable for interrogation in thecontext of the entire array, the adjacent DNA of the fragment can beused as the basis for probe set design. Preferably, the tag sequence ofthe fragment (to which a probe may be designed) is less than about 300bp away from the end of the regulatory DNA sequence. A probe (or probeset) that is approximately 300 bp away from a putative, site oftranscription factor binding is quite acceptable for determining whetherthe factor is bound there, e.g., by chromatin immunoprecipitation(ChIP), because the DNA fragments obtained in a ChIP experiment aretypically approximately 500 bp long.

The sequences (or probes) on each array can include regulatory sequencesfrom any number of cell types and/or subjects (with or without varioustreatment protocols). For instance, an exemplary microarray, termed “themaster epichip,” includes regulatory sequences that are broadlyrepresentative and inclusive of most or all of the complement of suchDNA regulatory elements present in a genome, e.g., a human genome.Typically, a “master epichip” includes regulatory sequences (or probesthereto) identified as described above from a broad panel of availableprimary human tissues and/or cell lines including, but not limited to,whole blood nucleated cells, bone marrow, placenta, fibroblasts, stemcells (embryonic and adult), myocytes, cancer cell lines covering a widerange of tumor types (by tissue of origin, histology, propensity tometastasis, etc.), and cells challenged with a variety of environmentalstimuli (heat shock, DNA damage, cell cycle arrest, growth stimulus, ECMculture substratum, etc.). Generally, a master epichip allows for thesimultaneous interrogation of at least 60,000 regDNA elements. Suchmaster epichips can be made from accessible sequences of any animal orplant (e.g., buffalo chip, potato chip). Additionally, master epichipscomprising regulatory sequences of infectious agents, such as bacteria,viruses and single-celled eukaryotes, can be prepared.

Other exemplary arrays will include regulatory sequences derivedprimarily or totally from one or more particular tissues or cell types.This type of array, termed a “tissue epichip,” typically includesregulatory sequences (or probes thereto) identified from a particulartissue or cell type, for example, brain, liver, heart, lung, muscle,connective tissue, breast, prostate, immune tissue, etc or tumorsthereof. To give but a single example, a hematological epichip wouldcontain regDNA prepared from whole-blood sorted nucleated cells and bonemarrow, and, in some embodiments, a defined panel of cells derived fromhematological malignancies, such as leukemias. Generally, a tissueepichip allows for the interrogation of more than 20,000 regDNAelements.

Yet another exemplary array is termed “a state-specific epichip” andcomprises a microarray of regDNA corresponding to the panel of regDNAelements in a given cell or tissue type that are responsive to aparticular environmental or developmental stimulus. The rnicroarray isassembled by subjecting the tissue/cell type of interest to one or morestimuli, for example, administration of a hormione, environmental insultsuch as DNA damage or other stress, etc.; and subsequently preparingregDNA as described above from treated and untreated samples. Inadditional embodiments, regDNA is prepared from diseased and normalcells, infected and uninfected cells, cells from different tissues, orcells at different stages of development. Known subtractive proceduressuch as subtractive hybridization and representational differenceanalysis (RDA) may be used to identify regDNA elements that are uniquelyrepresented in one or the other of the samples being compared. See, forexample, Lisytsin et al. (1993) Scieince 259:946-951; Lisytsin et al.(1995) Methods in Enzymology 254:291-304 and U.S. Pat. Nos. 5,436,142;5,501,964 and 5,958,738. Such unique sequences are then placed on anarray.

It is evident that the arrays of various dimensions can be used. Incertain embodiments, the regulatory sequences are prepared inmicroarrays, the term given to sets of miniaturized chemical reactionareas that may also be used to test DNA fragments, antibodies, orproteins and the like. Microarrays, and preparation of thesemicroarrays, are described extensively in the literature, for example inU.S. Pat. No. 6,576,424 and references cited therein. See also Horak etal. (2002) Proc. Natl. Acad. Sci. USA 99:2924-2929 and McGall et al.(2002) Adv. Biochem. Eng. Biotechnol 77:2142. An array of regulatorysequences, wherein the sequences present on the array are identified byvirtue of their accessibility in cellular chromatin, can comprise anynumber of sequences, e.g., two or more. In certain embodiments, the oneor more arrays as described herein contain a total of more than 50,000regulatory DNA sequences (or probes thereto) identified as describedabove, for example between about 20,000 and 100,000 sequences or anyvalue therebetween. In certain embodiments, approximately 65,000regulatory DNA elements, identified and isolated based on accessibilityin cellular chromatin, are ordered into one or more arrays. Further, theparticular sequences making up the array can be from the same cell type,including but not limited to, normal cells from the same or differentorgans/structures of a subject, diseased cells from the same ordifferent organs/structures of a subject, or cells treated with one ormore drugs such as small molecules (with a molecular weight less than 10kD), antibodies, or the like from the same or differentorgans/structures of a subject. Alternatively, a single array maycontain regulatory sequences from multiple different cell types and/orsubjects.

Methods for preparation of nucleic acids and/or proteins to be contactedwith an array (e.g., amplification, labeling) and methods for detectionof nucleic acid or protein bound at a particular site on an array areknown in the art and involve, for example, PCR, fluorescent labeling anduse of conjugated binding pairs such as avidin and biotin (e.g.,detection of a biotinylated polynucleotide with an avidin-conjugatedantibody or flurophore. Secondary antibodies conjugated to detectablemolecules or enzymes can be used for signal amplification.

VI. Applications

The regulatory DNA arrays (or “regDNA chip” or “epichip”) can be usedfor a variety of purposes. Non-limiting examples of such applicationsare set forth below.

A. Identification of Binding Sites for Human Regulatory Proteins

In yeast, chromatin immunoprecipitation-based methods have long beenused to identify regulatory sequences that are bound by particulartranscription factors and other DNA-binding proteins. As shown in thefirst four steps of the flowchart of FIG. 5, chromatinimmunoprecipitation generally involves (1) subjecting living cells toconditions which result in protein-DNA crosslinking, thereby covalentlylinking DNA-binding proteins to the sequences to which they are bound inthe cell; (2) shearing chromatin to a small size; (3)immunoprecipitating the sheared, crosslinked chromatin using an antibodyagainst the protein of interest, under conditions such that the DNAchemically crosslinked to the protein will co-precipitate; and (4)reversing the crosslinks to obtain the bound DNA for further analysis.Typically, the DNA portions of the immunoprecipitated crosslinkedcellular chromatin are then amplified, optionally labeled, andhybridized to a microarray containing the intergenic DNA from the yeastgenome. This type of analysis of chromatin immunoprecipitated DNA on anarray is also knowvn as “ChIP on a chip,” because it analyzes DNA outputfrom a chromatin immunoprecipitation (CHIP) on a regulatory DNAmicroarray, or chip. DNA that subsequently yields a high signal on themicroarray represents sequences that were bound iii vivo by the proteinof interest in the native nuclear context.

As noted above, since all yeast regulatory sequences are intergenic,arrays representing yeast sequences can be readily obtained simply byconstructing an array of intergenic sequences, and such arrays can beused to detect the targets of any given yeast transcription factor, forexample one that has been subject to chromatin immunoprecipitation.Wyrick et al., above and FIG. 5. However, for the complex human genome,“ChIP on a chip” cannot be conducted, as in yeast, by hybridizing DNAobtained from a ChIP to an array of intergenic sequences, because thevast amount of intergenic DNA in the human genome precludes theconstruction of a single chip (or even a small number of chips)containing the entire complement of human intergenic DNA. Consequently,analysis of regulatory protein binding sites in the human genome iscurrently limited to individual small stretches of the genome (Horak etal. (2002) Proc Natl Acad Sci 99:2924-2929; Martone et al. (2003) Proc.Natl. Acad. Sci. USA 100: 12,247-12,252); small subsets of genepromoters (Ren et al. (2002) Genes Dev 16:245-256; or computationallyidentified CpG-rich stretches of uncertain regulatory relevance(Weinmann et al. (2002) Gene Dev 16:235-244).

Furthermore, certain experiments have revealed binding of regulatoryfactors to cellular chromatin that appears to be spurious and notrelated to any regulatory process, indicating that it is. impossible touse a whole-genome microarray to determnine whether or not iii vivobinding of a regulator to a particular stretch is relevant to someregulatory process (Urnov (2003) J. Cell. Biochem. 84: 684).

The methods described herein allow the isolation, from among the largeamount of intergenic DNA in the human genome, of only those sequenceswhich serve a regulatory function; thereby making it possible, for thefirst time, to prepare a microarray of human regulatory sequences. Inaddition to intergenic regulatory sequences, regulatory sequenceslocated within genes are also obtained. Accordingly, the arrays producedas described herein make possible “ChIP on a chip” to identify thedirect in vivo targets, in the human genome, of any regulatory factor ofinterest. Moreover, and in contrast to previous methods, all bindingdetected in a ChIP assay, and further analyzed (by ChIP on a chip) usinga regDNA array, is relevant to regulation

The generation and use of regDNA chips to map human transcriptionalregulatory netw“rorks provides a unique opportunity to develop effectivetherapeutics for virtually every gene-based disease. For instance, asdetailed in Example 4 below, ChIP on a regDNA chip analysis of targetsof estrogen receptor will allow for the development of more clinicallyeffective selective estrogen receptor modulators (SERMs), for examplefor treating breast cancers. See, also, Ibrahim et al. (1999) Surg Oncol8:103-123. Similarly, chronic pain, which can be caused bytranscriptional upregulation of pain receptors in certain cells, affectsapproximately 50 million Americans. Cox et al. (2002) Expert RevNeurotherapeutics 1:81-91. Using the methods described herein, activeregulatory sequences unique to those cells can be isolated and placed onan array which can be used to identify transcriptional regulatorymolecules in the cells, thereby helping to identify the currentlyunknown nature of the lesion in this transcriptional regulatory network.

B. Identification of Sequence Targets

The arrays and methods described herein can be used to identify thesequence targets and binding locations of natural or synthetic DNAbinding proteins (e.g., transcription factors, replication factors,recombination factors, etc) and other DNA-binding molecules (e.g.,oligonucleotides, minor groove binders, antibiotics, chemotherapeutics).Furthermore, proteins tested by this method and shown to bind regulatorysequences associated with genes misregulated in disease are potentialtargets for therapeutic intervention. By using proteins derived fromnormal and/or diseased tissues, one can derive a functional link betweena particular protein and its role in regulation of genes in the normalor disease state in the cell.

A protein preparation is derived from any number of potential sources.The protein preparation may be derived from normal or diseased cells ortissues. The protein preparation may be derived by expression of thegene encoding the protein in a heterologous gene expression system (E.coli, yeast, insect cells, or mammalian cell culture, for example) andoptionally at least partly purified from this source. The protein may besynthesized artificially using standard protein synthesis techniques.

To identify regulatory sequences to which a protein binds, the proteinpreparation is put into contact with the DNA on a regDNA chip andallowed to bind. The chip can contain double-stranded or single-strandedDNA, depending on the binding properties of the protein. The protein canbe labeled with any detectable label prior to, or after, contact withthe array and location(s) where the protein preparation has bound can beidentified. For example, the protein can be labeled with a fluorescenttag, or a fluorescently-labeled antibody to the protein can be used fordetection. Alternatively, a detectable label can be attached to the DNAbound to the array; in this case, a loss of signal at one or moreparticular sites on the array indicates the presence of bound protein.Such DNA labels can include intercalating dyes such as ethidium bromideand SYBR Green. In additional embodiments, the nucleic acid (orpolypeptide) can be labelled with a fluorescent tag, and/or a nucleicacid (or polypeptide) binding molecule can be labelled with biotin, sothat an enzyme conjugate such as streptavidin-horse radish peroxidase(HRP), that catalyses an optically detectable change in a substrate(different from the fluorescent tag) can be used.

In addition, the genomic locations of the regulatory sequences bound bythe protein can be readily evaluated (e.g., by identifying theregulatory sequences on the chip that are bound by the protein andsearching for homology to those sequences in the human genome sequence),thereby providing an indication of which genes the protein regulates andindicating farther possible therapeutic targets. Using conventionaltranscriptional regulation assays, the protein can be further tested forits ability to regulate the gene(s), thereby confimming the identity ofpotential target genes and/or protein targets for therapeuticintervention.

C. RegDNA Profiling

An array (e.g., epichip) prepared as described above may be also used todetermine the spectrum of active regDNA elements in a given cell or cellpopulation. For example, a regulatory DNA library is obtained asdescribed above, its sequences are amplified, amplified sequences arelabeled with any suitable label, and the labeled, amplified sequencesare hybridized to an array (e.g., a master epichip or a tissue epichipas descnbed above). In this way, active regDNA sequences in any selectedcell or tissue type can be determined. This knowledge can then be usedto determine which transcription factors may be acting in those celltypes, for example, by searching the sequence of the regDNA fortranscription factor binding sites and/or by mapping the activeregulatory sequences onto the genome, identifying genes adjacent to themapped regulatory sequences, and comparing those genes to the cell'stranscriptome determined by genome-wide expression profiling.Transcription factors that are uniquely active in a particular cell typeprovide insight into pathways for potential therapeutic intervention invarious disease processes.

D. Chromatin Epigenome Profiling

The arrays described herein can also be used to determine the state ofhistone modification (“the histone code”) at the regDNA elements in anygiven cell type(s). For example, chromatin inimunoprecipitation isperformed (as described above) using an antibody that recognizes aparticular covalent chromatin modification (e.g., histone H3 methylatedon lysine 9). The immunoprecipitated DNA sequences are then hybridizedto a regDNA array. Sites on the array to which immunoprecipitated DNAhybridizes represent regulatory sequences located in or adjacent tonucleosomes bearing the particular chromatin modification of interest.

In addition, data from chromatin epigenomic profiling (e.g., genes thatare the direct targets of histone modifiers such as the human enhancerof zeste) can be compared between cells that overexpress the histonemodifier and cells that lack it. Typically, an increased signal frommodification of interest over a given DNA stretch is indicative ofdirect action by the modifier over this DNA stretch.

E. Chromatin-Based Toxicity Profiling

The arrays described herein also find use in evaluating the effects of acompound or treatment on a cell (e.g., toxicity, stress, etc.). Forexample, regDNA populations in treated cells can be isolated andcharacterized, and compared to those in untreated cells, if desired.Additionally, regDNAs prepared from treated cells can be hybridized to aregDNA array (epichip) as described herein to determine genes in thetreated cell that are active (based on proximity to regDNA sequencesisolated from the cell) in the treated cell, the histone code in thetreated cell, etc. Subtractive hybridization and/or difference analysis(see above) can be used to determine regulatory sequences and genes thatare preferentially activated in treated cells, compared to untreatedcells.

In additional embodiments, the effect of a molecule (e.g., toxin, drug,small molecule with molecular weight less than about 10 kD) on thebinding of one or more proteins to regulatory sequences can be assessed,either iii vitro or iii vivo. For example, a purified or partiallypurified protein can be assessed for its spectrum of binding to adouble-stranded regDNA chip, in the presence and absence of a compound.For iii vivo analyses, cells can be exposed to a compound, followed by“ChIP on a chip” analysis (see above) for a DNA-binding protein ofinterest, to determine whether the compound alters the bindingproperties of the protein.

F. SNP-Epichip

Single nucleotide polyniorphisms (SNPs) are stable, bi-allelic sequencevariants that are distributed. throughout the genome, which arecurrently assayed using a variety of high-throughput automated methods.See, e.g., Mullikin et al. (2000) Nature 407:516-520. Haplotypes arecollections of linked SNPs. Using the methods and compositions describedherein, SNPs and haplotypes in regulatory sequences can now beidentified in any given individual. In these embodiments, regDNA istypically prepared from cells (either pooled cells or a specific celltype) or from a selected individual and hybridized to an epichip asdescribed herein under conditions that allow SNP interrogation. Suchconditions can include high stringency and/or the use of functionalgroups and/or nucleotide analogues that facilitate single-nucleotidemismatch discrimination. See, for example, U.S. Pat. Nos. 5,801,155;6,127,121; 6,312,894; 6,485,906; and 6,492,346.

G. MicroRNA Validation

Short non-coding RNAs (microRNAs or miRNAs) are known to regulatecellular processes including development, heterochromatin formation, andgenomic stability in eukaryotes and have been studied using availablearray technology. Krichevsky et al. (2003) RNA 9(10):1274-81. However,using the regDNA arrays described herein now allows the functionalrelevance of microRNAs to be determined, for example, by preparing amicroRNA population from a cell, reverse-transcribing the RNA into cDNA,labeling the cDNA, and hybridizing the micro-cDNA to a regDNA epichip asdescribed herein. alternatively, the microRNA can be labeled directlyand used for hybridization. RegDNA elements that yield signal maycorrespond to microRNAs transcribed from accessible regions ofchromatin.

H. Drug Discovery

Since diseased cells will typically have different genes active than anon-diseased cells, analysis of regulatory DNA is particularlyapplicable to drug discovery. Indeed, the arrays and methods describedherein can pinpoint the differently active genes in diseased cells andthis knowledge can be used to identify therapeutic targets. Non-limitingexamples of diseases that can be addressed using the compositions andmethods described herein include cancers of various types, chronic pain,chronic pulmonary obstruction, diabetes, ischeniic heart disease,neuropathy, coronary artery disease, peripheral arterial disease,asthma, rheumatoid arthritis, endocrine disorders, bacterial infectionsand viral infections.

The arrays and methods described herein greatly simplify the search anddesign of drugs for any disease state. For example, using the arrays andmethods described above, the regulatory DNA subset active in a givencell type can be determined, for example regDNA that are aberrantlyactive (i.e., accessible) in individuals with at least one disorder(e.g., cancer, chronic pain, etc.). Computational analysis of theseaberrantly accessible elements (e.g., regDNAs located proximally to painreceptor genes) will help identify genes whose expression ismisregulated, leading to identification of the relevant regulatoryproteins. Such regulatory proteins, as well as the genes they regulate,are targets for therapeutic intervention. See, e.g., Sieweke et al.(2000) Methods Mol Biol 130:59-77.

1. Identification of Genes in a “Pre-Activation” State

Expression profiling methods utilize arrays of cDNAs or cDNA-specificoligonucleotides to provide information on genes that are expressed in acell under a particular set of conditions. See, e.g., Wyrick et at.(2002) Curr. Opin. Genet. Devel. 12: 130-136. However, transcriptionalactivation is a multi-step process, and includes steps that precede theproduction of a mRNA, which is the endpoint of an expression profilingassay. Isolation of regulatory sequences, as described herein, canidentify genes that have achieved a “pre-activation” state, in whichtheir regulatory sequences have become accessible, but transcriptioninitiation has not yet occurred. such pre-active genes may become activesubsequent to a secondary stimulus, or after passage of time. Comparisonof a regulatory sequence profile with an expression profile, for a givencell or tissue, allows distinction between genes that are activelytranscribed and genes that are capable of being transcribed, anddistinguishes both types from inactive genes.

J. Kits

The present disclosure also includes kits for obtaining informationregarding regulatory DNAs, disease, drugs, transcription pathways, etc.In certain embodiments, the kits comprise one or more of the arrays,regulatory DNAs, probes, combinations thereof, etc., described herein.For example, one exemplary kit will include at least one array thatallows identification of direct genomic targets of transcription factorswhile another kit includes at least one array(s) for identifying thesubset of regulatory DNA elements active in a given cell type. The kitsdescribed herein may also include one or more of the following:instructions, ancillary reagents or equipment, etc.

EXAMPLES

The following examples are illustrative of but do not limit the presentdisclosure:

Example 1 Preparation of Regulatory DNA Library from HEK 293 Cells

Human embryonic kidney cells (HEK 293) were cultured in DMEM (Dulbecco'smodified Eagle medium) supplemented with 10% fetal bovine serum in a 5%CO₂ incubator at 37° C. Cells were grown to 60% confluence, at whichpoint nuclei were isolated according to the method of Archer et a.(1999) Meth. Enzymol. 304:584-599. Briefly, the plate was rinsed withPBS, cells were detached from the plate and washed with PBS, thenhomogenized (Dounce A) in 10 mM Tris-Cl, pH 7.4, 15 mM NaCl, 60 mM KCl,1 mM EDTA, 0.1 mM EGTA, 0.1% NP40, 5% sucrose, 0.15 mM spermine and 0.5mM spermidine at 4° C. Nuclei were isolated from the homogenate bycentrifugation at 1,400×g for 20 min at 4° C. through a cushion of 10 mMTris-Cl, pH 7.4, 15 mM NaCl, 60 mM KCl, 10% sucrose, 0.15 mM spermineand 0.5 mM spermidine.

Pelleted nuclei were resuspended, to a concentration of 2×10⁷ nuclei perml, in 10 mM HEPES, pH 7.5, 25 mM KCl, 5 mM MgCl₂, 5% glycerol, 0.15 mMspermine, 0.5 mM spermidine, 1 nM dithiothreitol, 0.5 mMphenylmethylsulfonylfluoride (PMSF) and warmed to 37° C. for 30 sec. HpaII (New England Biolabs, Beverly, Mass.) was added to a finalconcentration of 10,000 Units/ml and the mixture was incubated at roomtemperature for 5 min. The reaction was stopped by addition of EDTA to50 mM.

An equal volume of 1% low-melting point agarose in 1×PBS warmed to 37°C. was then added, and the mixture was aspirated into the barrel of a 1ml tuberculin syringe and incubated at 4° C. for 10 min. The agaroseplugs were then extruded from the syringe and incubated for 36 hrs. withgentle shaking at 50° C. in 5 ml of 0.5 M EDTA, 1% SDS, 50 μg/mlproteinase K. The plugs were washed 3 times with 5 ml of 1×TE (pH 8.0)buffer, then incubated for 1 hr. at 37° C. in 1×TE with 1 mM PMSF,followed by two more washes with 1×TE. The plugs were placed in 2 ml ofSau3AI reaction buffer for 30 min. on ice to allow equilibration. Sau3AIwas then added to 2000 units/ml and the plugs incubated with gentleshaking for 16 hrs. at 37° C. The plugs were sliced with a razor bladesand slices were placed in the well of a 0.8% agarose gel in 1×TAE. Thegel was run at 50V for 8 hrs., stained with SYBR-Gold, and visualized ona Dark Reader transilluminator.

Fragments having an average size of between 50 and 1000 nucleotide pairswere purified from the gel by a Qiagen gel extraction kit. The fragmentspurified from the gel are a mixture of San 3AI fragments (i.e.,fragments having two Sau 3AI ends) and fragments having one Sau 3AI endand one Hpa II-generated end. The latter category of fragments isenriched for sequences accessible in chromatin. These fragments werepreferentially cloned as follows.

The resulting population of DNA fragments was inserted into pBluescriptII KS that had been digested with Bam HI and Cla I, under standardconditions. Under these conditions, Hpa II ends were inserted into theCla I site and the Sau 3AI ends were inserted into the Bam HI site.Approximately 40,000-50,000 clones were obtained.

Example 2 Analysis of Selected Clones

Approximately 1% (405) of the clones of the HEK library prepared asdescribed above were used to determine four parameters: percentage ofsequences corresponding to DNaseI hypersites; genomic locations of thecloned sequences; determination of regulatory properties; and proportionof unique clones.

A. Clones Corresponding to DNaseI Hypersensitive Sites

The fraction of clones in the library that correspond to DNAse Ihypersensitive sites (as opposed to, e.g., randomly sheared fragments)was tested using a pool of 10 clones randomly selected from the 405chosen for analysis. The clones in the library were isolated based ontheir accessibility to nucleases within cellular chromatin. Because ofthe massively parallel nature of such isolation, it was important toprove by an independent method that the clones isolated truly correspondto accessible regions of cellular chromatin, e.g., DNAse Ihypersensitive sites. See, for example, Gross and Garrard (1988) AnnuRev Biochem 57, 159-197. To obtain confimmation that the clonedsequences were obtained from accessible regions of cellular chromatin,the sequences of the ten clones were mapped on the genome, and thechromatin structure of the regions to which they mapped was determined(FIG. 2).

To map the cloned sequences on the genome, the human genome sequence wassearched, using each of the sequences as input. For each clone, a uniquelocation on the genome was obtained. For each of these locations, adiagnostic restriction enzyme was selected, which yielded a restrictionfragment spanning the area of the genome to which the clone mapped.DNase I hypersensitive site analysis (Wu (1980) Nature 286: 854-860) wasthen conducted in that area of the genome. Accordingly, nuclei wereisolated from HEK 293 cells, treated with DNAse I, DNA purified fromDNase-I treated nuclei was subjected to digestion with the diagnosticrestriction enzyme, and the locations of DNAse I hypersensitive siteswere identified by indirect end-labeling (Wu, supra). For 9 out of the10 clones, the DNA stretch in the genome identified by the clone residedin a DNAse I hypersensitive site in vivo. Four examples are provided inFIG. 2. Note that the lanes denoted “M” in FIG. 2 represent DNA digestedwith the diagnostic restriction enzyme and a marker restriction enizye,whose recognition sequence was within the diagnostic restrictionfragment, close to the area to which the clone mapped, thereby providinga reference point on the gel. These results confimm that the methodsdescribed herein produce nuclease cleavage in non-hypersensitive areasonly about 10% of the time, irrespective of the genomic location of theclone (see below)

B. Genomic Locations of Clones with Respect to Transcription Units

The genomic distribution of sequences represented in the clones wasevaluated, with respect to the locations of known transcription units,to determine what fraction of the clones identified novel regulatory DNAelements and what fraction fell into already identified regions, such ascore promoters.

Certain of the cloned sequences were located in gene promoters (exampleshown in FIG. 2A). However, this analysis also revealed that clonesmapped to sites well upstream of a transcription start site (e.g., FIG.2B), 20 kb downstream of a transcription startsite (e.g., FIG. 2C) andas far as 150 kb away from the nearest known gene (e.g. FIG. 2D).Subsequently, a broader analysis of genomic location of the 405 clonesrandomly isolated from the regulatory DNA library was undertaken. A keyprediction of a regulatory DNA isolation project is that a considerableproportion of the clones should derive from known regulatory DNAelements. A BLAST algorithm was used to evaluate the location of the 405clones relative to the transcription start site of 35,000 annotatedgenes in the human genome. As shown in FIG. 3, none of the clonesderived from repetitive DNA elements, which encompass about 50% of thehuman genome.

When the locations of the clones were compared to known transcriptionstartsites, 58% of the clones in the library map to within 10 kb of aknown transcription startsite (compared to only 12% of the human genomewhich lies within 10 kb of a known transcription startsite).Approximately 16% of the randomly chosen clones (66 out of 405) fellwithin the core promoter of known genes. The remaining 84% fell outsidecore promoter regions, a finding consistent with observations made onthose few well-studied loci in the human genome, including the β-globinand SCL regions, in which regulatory DNA has been comprehensivelyexperimentally mapped, and where a considerable majority of suchelements was found to lie outside of the core promoter region. Bulger etal. (2002) Curr Opin Genet Dev 12:170-177; Gottgens (2000) NatBiotechnol 18:181-186. Thus, the procedures described herein provideremarkable selectively for regulatory DNA and, in addition, identifyregulatory sequences that cannot be identified computationally (e.g,.the 84% of clones that do not map to core promoter regions) but whichare located in DNAseI hypersensitive sites (as shown in FIG. 2) andtherefore represent boizafide regulatory DNA.

C. Repulatory Properties

The relevance, to genome regulation, of the isolated accessiblesequences was evaluated to ascribe actual regulatory properties to thefragments, using criteria such as density of transcription factorbinding sites, conservation in genomes of other mammals, locationrelative to genes known to be active in humnan kidney cells, etc. Inparticular, to independently confimm that the non-promoter DNA sequenceswere regulatory DNA, three well-established criteria for regulatory DNAwere evaluated, essentially as described in Pennacchio et al. (2001) NatRev Genet 2:100-109, including: (1) sequence conservation between themouse and human genomes; (2) enrichment of transcription factor bindingsites; (3) location close to active genes.

As shown in FIG. 4, approximately 75% of the non-promoter, non-codingclones are located in short sequence stretches that are conservedbetween the mouse and human genome, representing an enormous enrichmentover what would have been expected based on the overall degree ofnon-coding conservation of DNA sequence between the mouse and humangenomes.

The isolated accessible DNA sequences are enriched relative to bulk DNAin known transcription factor binding sites. Pennacchio, above. Multiplechosen non-promoter sequences were analyzed using the publicly availableTransFAC database. Wingender et al. (2001) Nucl Acid Res 29:281-283. Onaverage, non-promoter clones had an approximately 3-fold greater numberof transcription factor binding sites per 100 bp than a randomly chosenDNA sequence of identical GC-content.

Chromatin remodeling (e.g., accessibility) at regulatory DNA is known tocorrelate with level of gene activity. Accordingly, the 235 clonesderived from within 10 kb of the start site of known genes were analyzedwith respect to the activity of their gene neighbor in HEK 293 cells,using an Affymetrix GeneChip® designed for this purpose. Approximately75% of the regulatory DNA clones were adjacent to (i.e., within 10 kb ongenes that are scored as being active in HEK 293 cells by GeneChip®analysis.

D. Proportion of Unique Sequences Cloned

To determine whether the library represents a comprehensive sampling ofaccessible sequences in cellular chromatin, the 405 clone sequences werecompared to each other in terms of their genomic location. Each cloneidentified a distinct location in the genome, indicating that, at leastin the 405 clones chosen, there was no skew towards a particular genomiclocation that is preferentially accessible within cellular chromatin.Furthermore, to determine whether the library is skewed in terms ofcontaining known regulatory DNA sequences, the genornic locations of theclones were compared to transcription start sites of known genes.According to this analysis, ˜20% of the clones identified locationswithin 1 kb of the transcription start sites of known genes.

These results demonstrate that at least 80% of the DNA fragments in thelibrary correspond to genome regulatory elements that cannot becomprehensively identified using any other computational or experimentaltechnique available. The relatively large proportion of non-promoterregulatory DNA elements active in HEK 293 cells is in accord with theliterature. Pennacchio et al. (2001) Nat Rev Genet 2:100-109.

In sum, the massively parallel isolation of regulatory DNA from humancells described herein result in pools of fragments in which (a) atleast 90% derive from DNAse I hypersensitive sites; (b) 16% derive fromcore gene promoters; (c) are enriched for elements within 10 kb of genetranscription start sites; (d) are enriched for DNA elements conservedbetween mouse and human genome; and (e) are enriched for sequences witha considerably higher than expected density of transcription factorbinding sites.

Example 3 Identification of Target Sequences of Estrogen Receptor (ER)

The human genome contains approximately 2,000 transcription factors thatregulate every aspect of human development, adult ontogeny, and disease.Aberrant function of transcription factors causes disease: for example,breast cancer results from the aberrant function of the estrogenreceptor (ER). Henderson et al. (2000) Carcinogenesis 21:427433.Although estrogen and the estrogen receptor are well established ascausative agents of breast cancers, little is known about the regulatorynetwork of breast epithelium response to ER. See, e.g., Sommer et al.(2001) Semin Cancer Biol 11:339-352; Sewacket al. (2001) Mol Cell Biol21:1404-1415; Shang et al. (2000) Cell 103:843-852; and Ghosh et al.(2000) Cancer Res 60:6367-6375.

The primary obstacle to developing more effective therapeutic agents forbreast cancer is thus the lack of information about the direct genomictargets of ER in the human genome. It is known that estrogen affectstranscription of approximately 2,000 genes, but as little as 10 havebeen tentatively identified as direct targets. As a result of thisinformation void, existing therapeutics that affect function of ER,e.g., tamoxifen, are only partly effective. If the direct targets of ERwere known, then modulators of its function could be evaluated directlybased on their effects on target genes most critical to disease onsetand progression, but these direct targets remain largely unknown.

The following experiments are performed to identify direct targetsequences of the ER transcription factor. Chromatin immunoprecipitation(ChIP) is conducted on human breast carcinoma line MCF-7 (ATCC AccessionNo. HTB-22) using an anti-ER antibody. See, for example, Kuo et al.(1999) Methods 19:425-433; O'Neill et al. (1999) Meth. Enzymology274:189-197 and Orlando (2000) Trends Biochem. Sci. 25:99-104.Antibodies directed against the estrogen receptor are commerciallyavailable. Positive controls are obtained by analysis of known ER targetgenes including pS2 (Sewack et al. (2001) Mol Cell Biol 21:1404-1415);cathepsin W (Shang et al. (2000) Cell 103:843-852); PDZK1, and GREB1(Ghosh et al. (2000) Cancer Res 60:6367-6375). Negative controls areobtained from MCF-7 cells cultured in the presence of estrogen andinsulin because, under these culture conditions, ER does not bind to itstarget sites and relocates to the cytoplasm. Sommer et al. (2001) SeminCancer Biol 11:339-352. Using these controls, only ChIP results thatshow at least 5-fold enrichment for core promoters of the positivecontrol genes relative to the negative controls are selected foranalysis on a regDNA chip.

To determine direct genomic targets of ER, the ChIP outputs from treatedcells meeting these selection criteria are hybridized to a regDNA chipand the resulting pattern compared to the pattern of hybridization fromChIP performed on cells that were not treated with estrogen. Analysis isconduced essentially as described in Horak et al. (2002) Proc Nat'l AcadSci USA 99:2924-2929; Ren et al. (2002) Geizes Dev 16:245-256; andWeinmarm et al. (2002) Genes Dev 16:235-244. The data is evaluated usingthree independent metrics: (1) increase of at least 2.5 fold of a signalfor known ER targets over control targets (e.g., genes such as GAPDH,13-actin); (2) positional analysis of identified DNA regulatorystretches bound by ER relative to genomic position of genes for whichtranscription is known to be affected by ER; and (3) target validationby manual analysis (e.g. using PCR with primers that amplify regulatoryDNA identified by the regDNA chip to confimm binding of ER; see e.g.,Martone et al (2003) supra).

Example 4 Analysis of Drug Effects

The following experiments are also conducted to determine the effect ofestrogen and/or tamoxifen on gene activity in breast cancer cells.

A. Estrogen

Previously, more than 550 genes have been identified to be activated byleast 3-fold, and approximately 450 have been shown to be repressed byat least about 2-fold, upon estrogen treatment of MCF-7 cells.Accordingly, to examine the effects of estrogen on regulatory sequences,MCF-7 cells are starved of estrogen and insulin for 7 days, and thenhalf of the cells are treated with both hormones for 48 hrs. RegulatoryDNA is prepared from both cell populations as described above andcompared to the corresponding mRNA expression profile.

Duplicate batches of regulatory DNAs from estrogen treated and untreatedcells are hybridized to regDNA chips. Expected results include at leasta 2-fold decrease in regulatory DNA hybridization to the regDNA chip of50% of those genes that are known to be repressed upon estrogentreatment. In addition, a positive correlation between gene activity andrepresentation of its regulatory DNA in the regulatory DNA profile andlow S.E.M. (<20% total signal) between biological duplicates isexpected.

B. Estrogen and Tamoxifen

The nature of tissue-specific differences of tamoxifen action (which isanti-estrogenic in the breast and pro-estrogenic in the endometrium) isdetermined by comparing 4 datasets: (i) regDNA-wide distribution of ERin breast tissue following estrogen treatment; (ii) regDNA-widedistribution of ER in breast tissue following tamoxifen treatment; (iii)regDNA-wide distribution of ER in the endometrium following estrogentreatment; (iv) regDNA-wide distribution of ER in the endometriumfollowing tamoxifen treatment.

Differences in the regDNA stretches occupied by ER in the breast areexpected, depending on whether the tissue is treated with tamoxifen orestradiol. A large number of genes, however, will be bound by ER inbreast tissue both in the presence of tamoxifen or estradiol—these willrepresent those ER targets most directly relevant to ER action in thebreast. At the same time, it is expected that a large number of genes inthe endometrium will be bound by ETR in the presence of both ligands.The critical step, therefore, will be to identify those genes that arebound by ER in the breast, but not in the endometrium, and vice versa.Furthermore, it will be critical to determine how ER distribution onthose genes (assayed e.g. by ChIP on a regDNA chip) is affected byestrogen vs. tamoxifen treatment. Tissue-to-tissue and ligand-to liganddifferences between these samples will illuminate genes directlyrelevant to the tissue-specific action by these ER ligands.

All references cited herein are hereby incorporated by reference intheir entireties for all purposes. cl LIST OF REFERENCES

Birrell, G. W. et al. Transcriptional response of Saccharomycescerevisiae to DNA-damaging agents does not identify the genes thatprotect against these agents. Proc Natl Acad Sci USA 99, 8778-83.(2002).

Bulger, M., Sawado, T., Schubeler, D. & Groudine, M. ChIPs of thebeta-globin locus: unraveling gene regulation within an active domain.Curr Opin Genet Dev 12, 170-7. (2002).

Cox, J. M. & Papagallo, M. Contemporary and emergent pharmacologicaltherapies for chronic pain: nonopiod analgesia. Expert Rev.Neurotherapeutics 1, 81-91 (2002).

Elgin, S. C. The formation and function of DNase I hypersensitive sitesin the process of gene activation. J Biol Chem 263, 19259-62 (1988).

Galas, D. J. Sequence interpretation. Making sense of the sequence.Science 291, 1257-60. (2001)

Ghosh, M. G., Thompson, D. A. & Weigel, R. J. PDZK1 and GREB1 areestrogen-regulated genes expressed in hormone- responsive breast cancer.Cancer Res 60, 6367-75. (2000).

Giaever, G. et al. Functional profiling of the Saccharomyces cerevisiaegenome. Nature 418, 387-91. (2002)

Gottgens, B. et al. Analysis of vertebrate SCL loci identifies consentedenhancers. Nat Biotechnol 18, 181-6. (2000).

Gross, D. S. & Garrard, W. T. Nuclease hypersensitive sites inchromatin. Annu Rev Biochem 7 57, 159-97 (1988).

Hebbes, T. R., Clayton, A. L., Thorne, A. W. & Crane-Robinson, C. Corehistone hyperacetylation co-maps with generalized DNase I sensitivity inthe chicken b-globin chromosomal domain. EMBO J 13, 1823-30 (1994).

Henderson, B. E. & Feigelson, H. S. Hormonal carcinogenesis.Carcinogenesis 21, 427-33 (2000).

Horak, C. E. et al. GATA-1 binding sites mapped in the beta-globin locusby using mammalian chip-chip analysis. Proc Natl Acad Sci USA 99,2924-2929. (2002).

Ibrahim, N. K. & Hortobagyi, G. N. The evolving role of specificestrogen receptor modulators (SERMs). Surg Oncol 8, 103-23 (1999).

Johnson, K. D. & Bresnick, E. H. Dissecting long-range transcriptionalmechanisms by chromatin immunoprecipitation. Methods 26, 27-36. (2002).

Kozlova, T. & Thummel, C. S. Steroid Regulation of PostembryonicDevelopment and Reproduction in Drosophila. Trends Endocrinol Metab 11,276-280 (2000).

Nal, B., Mohr, E. & Ferrier, P. Location analysis of DNA-bound proteinsat the whole-genome level: untangling transcriptional regulatorynetworks. Bioessays 23, 473-6. (2001)

Pennacchio, L. A. & Rubin, E. M. Genomic strategies to identifymammalian regulatory sequences. Nat Rev Genet 2, 100-9. (2001)

Pilpel, Y., Sudarsanam, P. & Church, G. M. Identifying regulatorynetworks by combinatorial analysis of promoter elements. Nat Genet 29,153-9. (2001)

Ren, B. et al. E2F integrates cell cycle progression with DNA repair,replication, and G(2)/M checkpoints. Genes Dev 16, 245-56. (2002).

Ren, B. et al. Genome-wide location and function of DNA bindingproteins. Science 290, 2306-9 (2000)

Sewack, G. F., Ellis, T. W. & Hansen, U. Binding of TATA Binding Proteinto a Naturally Positioned Nucleosome Is Facilitated by HistoneAcetylation. Mol Cell Biol 21, 1404-1415. (2001).

Shang, Y., Hu, X., DiRenzo, J., Lazar, M. A. & Brown, M. Cofactordynamics and sufficiency in estrogen receptor-regulated transcription.Cell 103, 843-52 (2000).

Sieweke, M. Detection of transcription factor partners with a yeast onehybrid screen. Methods Mol Biol 130, 59-77 (2000).

Sommer, S. & Fuqua, S. A. Estrogen receptor and breast cancer. SeminCancer Biol 11, 339-52. (2001).

Umov, F. D. A feel for the template: zinc finger protein transcriptionfactors and chromatin. Biochem Cell Biol 80, 321-333 (2002).

Umov, F. D., Rebar, E. J., Reik, A. & Pandolfi, P. P. Designedtranscription factors as structural, functional and therapeutic probesof chromatin in vivo: Fourth in review series on chromatin dynamics.EMBO Rep 3, 610-5. (2002).

Verreault, A. De novo nucleosome assembly: new pieces in an old puzzle.Genes Dev 14, 1430-8 (2000).

Weinnamn, A. S. & Famnham, P. J. Identification of unknown target genesof human transcription factors using chromatin immunoprecipitation.Methods 26, 3747. (2002).

Weinmann, A. S., Yan, P. S., Oberley, M. J., Huang, T. H. & Fariham, P.J. Isolating human transcription factor targets by coupling chromatinimmunoprecipitation and CpG island microarray analysis. Genes Dev 16,235-44. (2002).

Wingender, E. et al. The TRANSFAC system on gene expression regulation.Nucleic Acids Res 29, 281-3. (2001).

Wyrick, J. J. & Young, R. A. Deciphering gene expression regulatorynetworks. Curr Opin Genet Dev 12, 130-136 (2002)

1. A method for making an array, the method comprising: (a) isolating aplurality of cellular polynucleotide sequences, whereby the sequencesare isolated based on their accessibility in cellular chromatin; and (b)attaching each of the isolated sequences to an address on a solidsupport.
 2. An array comprising a plurality of accessible polynucleotidesequences, wherein: (a) the sequences are isolated based on theiraccessibility in cellular chromatin; and (b) each accessible sequence islocated at a distinct address on a solid support.
 3. The array of claim2, wherein the accessible sequences are isolated from a plurality ofdifferent cell types from an organism.
 4. The array of claim 2, whereinthe accessible sequences are isolated from a single cell or tissue typefrom an organism.
 5. The array of claim 2, wherein the accessiblesequences are isolated according to the following procedure: (a)isolating a first plurality of cellular polynucleotide sequences,whereby the sequences are isolated based on their accessibility incellular chromatin from a first cell; (b) isolating a second pluralityof cellular polynucleotide sequences, whereby the sequences are isolatedbased on their accessibility in cellular chromatin from a second cell;(c) obtaining sequences that are unique to either the first or secondplurality of cellular polynucleotide sequences; and (d) attaching eachof the isolated sequences obtained in step (c) to an address on a solidsupport.
 6. A method of identifying a target sequence bound by aDNA-binding protein, the method comprising the steps of: (a) contactingat least one DNA-binding protein with an array according to claim 2,under conditions such that the protein binds to accessible sequencescomprising a target sequence bound by the protein; (b) removing unboundproteins; and (c) identifying the accessible sequences bound by theprotein, thereby identifing target sequences for the protein.
 7. Amethod of identifying a transcription factor, the method comprising thesteps of: (a) preparing a preparation of proteins from a cell; (b)contacting the isolated proteins with an array according to claim 2,under conditions such that transcription factors in the proteinpreparation bind to accessible sequences comprising a target sequencebound by a transcription factor; (c) removing unbound proteins; and (d)identifying the proteins bound to the array.
 8. A method for obtaining aregulatory profile of accessible sequences in a cell, the methodcomprising: (a) isolating a plurality of polynucleotide sequences fromthe cell, whereby the sequences are isolated based on theiraccessibility in cellular chromatin; (b) optionally amplifying thesequences obtained in step (a); (c) optionally labeling the sequences ofstep (a) or (b); (d) contacting the sequences of step (a), (b) or (c)with an array according to claim 3; and (e) identifying the accessiblesequences bound on the array, thereby identifying sequences that areaccessible in the cell.
 9. A method for identifying functional bindingsites for a DNA-binding protein in a cell, the method comprising: (a)subjecting a cell to conditions under which DNA-binding proteins arecrosslinked to their binding sites in cellular chromatin; (b) shearingthe crosslinked cellular chromatin of step (a); (c) inmrunoprecipitatingthe sheared crosslinked chromatin of step (b) with an antibody whichrecognizes the DNA-binding protein; (d) reversing the crosslinks in theimmunoprecipitate of step (c); (e) purifying the DNA from theirnmunoprecipitated material of step (d); (f) optionally amplifying theDNA obtained in step (e); (g) optionally labeling the DNA of step (e) or(f); (h) contacting the DNA from step (e), (f) or (g) with an arrayaccording to claim 2; and (i) identifying the accessible sequences boundon the array, thereby identifying functional binding sites for theDNA-binding protein in the cell.
 10. A method of identifying a sequencein cellular chromatin, wherein the chromatin is covalently modified, themethod comprising: (a) providing a sample of cellular chromatin; (b)optionally subjecting the chromatin of step (a) to conditions underwhich DNA-binding proteins are crosslinked to their binding sites incellular chromatin; (c) shearing the cellular chromatin of step (a) or(b); (d) immunoprecipitating the sheared chromatin of step (c) with anantibody which recognizes a covalent chromatin modification; (e)purifying the DNA from the immunoprecipitated material of step (d); (f)optionally amplifying the DNA obtained in step (e); (g) optionallylabeling the DNA of step (e) or (f); (h) contacting the DNA from step(e), (f) or (g) with an array according to claim 2; and (i) identifyingthe accessible sequences bound on the array, thereby identifyingsequences in cellular chromatin wherein the chromatin is covalentlymodified.
 11. A method for characterizing the effects of a molecule on acell, the method comprising: (a) contacting the cell with the molecule;(b) isolating a first plurality of polynucleotide sequences from thecell of step (a), whereby the sequences are isolated based on theiraccessibility in cellular chromatin; (c) optionally amplifying thesequences obtained in step (b); (d) optionally labeling the sequences ofstep (b) or (c); (e) contacting the sequences of step (b), (c) or (d)with an array according to claim 2; and (f) identifying the accessiblesequences bound on the array, thereby identifying sequences that areaccessible in the cell.
 12. The method of claim 11, further comprisingthe steps of: (g) providing cells that have not been contacted with themolecule; (h) isolating a second plurality of polynucleotide sequencesfrom the cell of step (g), whereby the sequences are isolated based ontheir accessibility in cellular chromatin; (i) optionally amplifying thesequences obtained in step (h); (j) obtaining sequences that are uniqueto either the first or second plurality of polynucleotide sequences; (k)optionally amplifying the sequences obtained in step (j); (l) optionallylabeling the sequences of step (j) or (k); (m) contacting the sequencesof step (j), (k) or (1) with an array according to claim 2; and (n)identifying the accessible sequences bound on the array, therebyidentifying differences in accessible sequences between cells that haveand have not been contacted with the molecule.
 13. A method ofidentifying single nucleotide polymorphisms (SNPs) in regulatorysequences of an individual, the method comprising the steps of: (a)preparing a library of regulatory DNA sequences from chromatin isolatedfrom cells from the individual; (b) optionally labeling the sequences ofstep (a); (c) hybridizing the sequences of step (a) or (b) to an arrayaccording to claim 2 under stringent hybridization conditions, whereinthe regulatory DNA sequences of the library hybridize to complementaryaccessible sequences on the array; (d) removing regulatory DNA sequencesof the library that are not bound to accessible sequences on the array;and (e) identifying accessible sequences on the array that are nothybridized to regulatory DNA sequences of the library, wherein theunbound accessible sequences on the array suggest the presence of a SNPin regulatory sequences of the individual corresponding to the unboundaccessible sequence.
 14. A method for characterizing the effects of astimulus on a cell, the method comprising: (a) subjecting the cell tothe stimulus; (b) isolating a first plurality of polynucleotidesequences from the cell of step (a), whereby the sequences are isolatedbased on their accessibility in cellular chromatin; (c) optionallyamplifying the sequences obtained in step (b); (d) optionally labelingthe sequences of step (b) or (c); (e) contacting the sequences of step(b), (c) or (d) with an array according to claim 2; and (f) identifyingthe accessible sequences bound on the array, thereby identifyingsequences that are accessible in the cell.
 15. The method of claim 14,further comprising the steps of: (g) providing cells that have not beensubjected to the stimulus; (h) isolating a second plurality ofpolynucleotide sequences from the cell of step (g), whereby thesequences are isolated based on their accessibility in cellularchromatin; (i) optionally amplifying the sequences obtained in step (h);(j) obtaining sequences that are unique to either the first or secondplurality of polynucleotide sequences; (k) optionally amplifying thesequences obtained in step (j); (l) optionally labeling the sequences ofstep (j) or (k); (m) contacting the sequences of step (j), (k) or (l)with an array according to claim 2; and (n) identifying the accessiblesequences bound on the array, thereby identifying differences inaccessible sequences between cells that have and have not been subjectedto the stimulus.